Structure of Cobulid dictionary file
While (looking at the script, being to write and, to cause by mistake the possibility of being there is a small place)
* Files
=======
- The type of CD-ROM:
1. Book 3rd edition
1.1 Collins COBUILD ON CD-ROM ISBN:0-00-710884-2 ('Version 1.0 Software')
1.2 Collins COBUILD ON CD-ROM ISBN:0-00-715905-6 ('Version 2.0 Software')
2. Book 4th edition ('Version 3.0 Software')
2.1 Advanced Learner's English Dictionary + CD-ROM ISBN:0-00-715799-1 Hardback
2.2 Advanced Learner's English Dictionary + CD-ROM ISBN:0-00-715800-9 Paperback
2.3 Collins Cobuild on CD-ROM Resource Pack ISBN:0-00-716921-3
2.1 With the contents of 2.2 CD-ROM are thought it is identical.
As for 2.1/2.2 dictionary itself and only Wordbank.
As for the other things grammar, the thesaurus and use belong.
2.1/2.2 and in 2.3 the file of the dictionary itself differs somewhat.
- Version 1
Ahead installing directory
EN MBX 244,901 01-12-11 11:38 en.mbx pictures
EN SPL 524,073 01-12-11 11:38 en.spl spellings
EN WBX 549,351,419 01-12-11 11:40 en.wbx pronunciations
EN-CC3 FTX 17,685,849 01-12-11 11:37 en-cc3.ftx
EN-CC3 REL 266,056 01-12-11 11:37 en-cc3.rel
EN-CC3 TRD 15,191,804 01-12-11 11:37 en-cc3.trd dictionaries
EN-CGR FTX 1,393,929 01-12-11 11:37 en-cgr.ftx
EN-CGR TRD 1,158,940 01-12-11 11:37 en-cgr.trd grammar
EN-CTH FTX 1,331,337 01-12-11 11:37 en-cth.ftx
EN-CTH TRD 1,590,516 01-12-11 11:37 en-cth.trd thesauruses
EN-CUS FTX 2,473,533 01-12-11 11:37 en-cus.ftx
EN-CUS TRD 1,645,144 01-12-11 11:37 en-cus.trd uses
EN-CWB FTX 37,906,681 01-12-11 11:38 en-cwb.ftx
EN-CWB TRD 25,367,548 01-12-11 11:37 en-cwb.trd Wordbank
- Version 2
Ahead installing 'data' directory of directory
EN SPL 523,273 02-09-04 4:00 en.spl spellings
HCP-CC3 DDF 3,253 02-09-04 4:00 hcp-cc3.ddf
HCP-USR DDF 224 02-09-04 4:00 hcp-usr.ddf
HCP_EN WBX 549,351,419 02-09-04 4:00 hcp_en.wbx pronunciations
HCP_EN‾1 FTX 22,913,217 02-09-04 4:00 hcp_en-cc3.ftx
HCP_EN‾1 MBX 244,901 02-09-04 4:00 hcp_en-cc3.mbx pictures
HCP_EN‾1 REL 334,395 02-09-04 4:00 hcp_en-cc3.rel
HCP_EN‾1 TRD 15,055,464 02-09-04 4:00 hcp_en-cc3.trd dictionaries
HCP_EN‾2 FTX 1,634,201 02-09-04 4:00 hcp_en-grm.ftx
HCP_EN‾2 REL 5,121 02-09-04 4:00 hcp_en-grm.rel
HCP_EN‾2 TRD 997,520 02-09-04 4:00 hcp_en-grm.trd grammar
HCP_EN‾3 FTX 1,576,417 02-09-04 4:00 hcp_en-gth.ftx
HCP_EN‾3 TRD 1,467,456 02-09-04 4:00 hcp_en-gth.trd thesauruses
HCP_EN‾4 FTX 2,978,453 02-09-04 4:00 hcp_en-usg.ftx
HCP_EN‾4 TRD 1,512,784 02-09-04 4:00 hcp_en-usg.trd uses
HCP_EN‾5 FTX 35,691,873 02-09-04 4:00 hcp_en-wbk.ftx
HCP_EN‾5 TRD 23,348,888 02-09-04 4:00 hcp_en-wbk.trd Wordbank
- Version 3
Ahead installing 'data' directory of directory
523273 May 23 2003 en.spl spellings
3745 May 23 2003 hcp-cc3.ddf
224 May 23 2003 hcp-usr.ddf
19121229 May 23 2003 hcp_en-cc3.ftx
244901 May 23 2003 hcp_en-cc3.mbx pictures
314898 May 23 2003 hcp_en-cc3.rel
13251280 May 23 2003 hcp_en-cc3.trd dictionaries
35691873 May 23 2003 hcp_en-wbk.ftx
23348888 May 23 2003 hcp_en-wbk.trd Wordbank
549351419 May 23 2003 hcp_en.wbx pronunciations
20150 May 23 2003 lexsmb.mbx
In case of the Resource Pack, there is also a file of grammar, the thesaurus and use.
*.trd Text data
*.ftx Indexes for fulltext search
*.rel Relation file
*.wbx Pronunciation data
*.mbx Graphics data
*.spl Spelling data
*.ddf Dictionary explanatory file
* Structure of dictionary file
==============================
Structure of *.trd file is as follows:
Header section
Obscurity (you do not inspect)
Index section
Data division
- Header section
As for length 128 bytes (80h) fixing.
As for numerical value all Little endian.
Position length
0h 64 Copyright notice (as for remainder 00h)
40h 16 Unclear
50h 4 Data number of cases
54h 4 Unclear
58h 4 Index of bases
5Ch 4 Index of offsets (data number of cases + 1)
60h 4 Unclear
64h 4 Index region position
68h 4 Data region position
6Ch 4 Unclear
70h 16 Unclear
- Index section
It is housed with following kind of type:
base 1
base 2
...
base n
ofs 1..64 (for base 1)
ofs 1..64 (for base 2)
...
ofs 1..x (for base n)
First the index base is housed index based several minutes.
Length
4 index bases (offset from the data division first)
Consequently concerning index based 1 case, the index offset, 64 cases at a time
It is housed in order.
Concerning the last index base, the index offset 64 cases compared to
There is a little thing.
Length
2 index offsets (offset of 4 byte unit from the index based first)
It calculates data record position with the following formula.
Data record position = data region position
+ Index base
+ (Index offset * 4)
As for length of each data record, up to the next data record position.
As for the number of index offsets data number of cases + 1,
In other words the last index offset exists in order to obtain the length of the last data.
- Data division
The record of the following type continuing, it is housed.
Length
1 text item several
- Discoverable item
- Text item 1
...
- Text item N
As for each item, the bit flag which shows the data elements which are housed (1 byte) with
Continuation of actually data.
The letter data is housed with the bitstream of 6 bit unit.
When below, there is no especially description, as for the data the letter data.
== structure of discoverable
item
Flag
0x01 identification division
0x01 entry word
0x02 suspicious look / key? (Only synonym and grammar)
0x08 part of speech (only synonym)
0x10 change shape [ CD-ROM v3.0 ]
0x20 use (only synonym)
0x80 pronunciation
0x02 additional data
Is made the discoverable classified by 0x01 the reason
which 0x02 sentence structure [ Resource Pack ]
0x04 amount spelling [ CD-ROM v3.0 ]
0x08 notes
0x10 chapter paragraph number (only grammar)
0x20 frequency
The data displays frequency stage at numerical value of 1 byte
(1..5, with CD-ROM v3.0 1..3)
0x80 voice data
The data of the following type continuing, it is housed.
01h: There is a data and (the data 3 byte continues, the first data?
80h: There is a data and (the data 3 byte continues, the data other than the first?
00h: Data end
== structure of text
item
Flag
0x01 small index
0x01 index
0x02? Difference index [ CD-ROM v3.0 ] (only 1 case of faff, compilation mistake?
0x08 part of speech
0x10 derivative shape
Discoverable explanation of whole 0x20 [ Resource Kit ]
Such as 0x02
use 0x02 suspicious look
0x04 amount spelling [ CD-ROM v3.0 ]
0x08 notes (only synonym is not indicated)
0x10 use (with grammar chapter paragraph number)
0x20? (Only dictionary itself)
As for data at 1 byte as for value the 01h..05h (frequency?
0x80 number or sign
0x10 text
0x01 meaning
0x02 example
As for data the first 1 byte the number of examples. Below, the example data continues.
0x04 chart
As for data the first 1 byte the number of lines. Below, the data of each line continues.
0x08 reference
As for data the first 1 byte the number of references. Below, each data continues.
0x40 reference (See also) [ CD-ROM v3.0 ]
0x80 picture
As for data the first 1 byte flag. The next 4 bytes picture number.
0x40-related language
0x01 synonym
0x02 antonym
* Letter data
============
The letter data with the bitstream of 6 bit unit is housed at the byte unit.
As for usual letter 6 bits, usually as for letter 12 bits, you display external character in 30 bits.
As for end of letter data 000000b.
Value of the first 6
bit 0 character string terminals
1. .26 The English small letter (the A.z)
32. .38 Usual sign
40. .62 Adjusting to succeeding letter, 1 letter (the English capital letter) such as number, sign and accent equipped
63 external characters (the value which -1 is done respectively from the next 4 letters is suitable to each 4 bits of unicode)
Details the CobuidLib.rb must be reference.
The tag is used inside the letter data.
- < Tag. Character string > type EX
B bold type (bold) [ CD-ROM v3.0 ]
E emphasis (emphasis)
I non-commutative field (italic)
U underline? (Underline? Used?
C notes (comment) (informal and Brit....)
F phrase? (Phrase?
X cross reference (cross reference)
G grammatical element? (Grammar? (Of, IN and see also...)
O option? Pronoun? (Your and somenoe...)
V? (See and one's...) [ CD-ROM v3.0 ]
W? (WEAK and STRONG...) [ CD-ROM v3.0 ]
The A United States (American)
B England (British)
O Option? (To, with and someone...)
E exponent? (Exponent? (Upper equipped number)
S sign (Symbol) [ CD-ROM v3.0 ]
- < Tag > type EX
H amount spelling letter (hyphenation) [ CD-ROM v3.0 ]
DW warning
Li chart item (list item)
Lb chart start (list begin)
Le chart end (list end)
Z [ 1-9 ]? [ A-z ] Pronunciation idea contest ..
A [ a-z ] the American pronunciation idea contest [ CD-ROM v3.0 ]
B [ a-z ] the English pronunciation idea contest [ CD-ROM v3.0 ]
As for type in chart
......
* Example
of data ============
You explain the item of A as example. The data has become as follows.
05 81 01 F6 09 20 04 00 01 B3 FB 00 80 B4 FB 00
00 13 18 F6 D9 BD D7 D8 3d C4 00 F6 09 20 06 26
81 8e 0F 60 F8 74 E4 80 1f 87 4E 26 82 8c 00 80
F9 10 00 01 88 58 7d 82 38 09 4E 05 08 16 01 89
49 35 20 30 55 14 15 28 0F 1a 05 08 16 0F 64 38
73 09 4c 88 01 31 02 01 08 55 21 00 13 18 F6 D9
...
0. Text item several
As for 05 first 1 bytes the number of text items, in other words there are 5 items.
1. Discoverable item
You discover 81 next 1 bytes, the flag of item.
The identification division (the 0x01) with the voice data (the 0x80) it is found that it is.
1.1 Identification division
As for the forefront of 01 identification divisions flag. The data, entry word (only the 0x01).
1.1.1 Entry word
F6 09 20 04 00
When this is displayed at the quantities of two Shin,
11110110 00001001 00100000 00000100 00000000
You analyze at 6 bit unit, the null (the 000000b) with become data end.
111101 100000 100100 100000 000001 000000 0000
A, A
1.2 Voice data
01 B3 FB 00 80 B4 FB 00 00
You can analyze this as follows.
There are 01 data, (the first data?
B3 FB 00 voice data number (?
There are 80 data, (the halfway data?
B4 FB 00 voice data number (?
00 data ends
2.0 Text item
Below, equal to the number of text items, text item continues.
2.1 Text item 1
As for 13 first bytes flag.
The small index (the 0x01), use (the 0x02), the text (the 0x10) it is found that it is.
2.1.1 Small index
As for 18 first bytes flag.
Part of speech (the 0x08), change shape (the 0x10) it is found that it is.
2.1.1.1 Part of speech
F6 D9 BD D7 D8 3d C4 00
Character string. As for value,
N-VAR
2.1.1.2 Change shape
F6 09 20 06 26 81 8e 0F 60 F8 74 E4 80 1f 87 4E 26 82 8c 00
Character string. As for value,
A, A A's and a's
The tag which shows the pronunciation idea contest.
In the first data of the voice data of the identification division, corresponds to the 2nd data.
2.1.2 Use
80 flags. As for data number or sign (only 0x80).
2.1.2.1 Number or sign
F9 10 00
Character string. As for value,
1
In other words, word meaning number.
2.1.3 Text
01 flags. As for data meaning (only 0x01).
2.1.3.1 Meaning
88 58 7d 82 38 09 4E 05 08 16 01 89 49 35 20 30
55 14 15 28 0F 1a 05 08 16 0F 64 38 73 09 4c 88
01 31 02 01 08 55 21 00
Character string. As for value,
Is the first letter of the English alphabet.
Emphasis tag. If you refer to the html, with A bold type indication.
2.2 Text item 2
Below, similar to text item 1.
* Structure of graphics data
file ==========================
* Structure of.mbx file is as follows.
Header section
Unclear
Table section
Index section
Data division
- Header section
As for length 128 bytes (80h)?
As for numerical value all Little endian.
Position length
0h 64 copyright notice (as for remainder 00h)
40h 16 unclear
50h 4 table number of cases
54h 4 index number of cases
58h 4 data number of cases
5ch 4 unclear
60h 4 table region position
64h 4 index region position
68h 4 data region position
6ch 4 unclear
70h 16 unclear
- Table section
You do not inspect.
- Index section
The record of the following type continuing, index number of cases amount, it is housed.
Length
4 data positions (offset from the data division first)
- Data division
The record of the following type continuing, it is housed.
Length
1 obscurity?
- Graphics data (GIF type)
The first record and the last record seem like the dummy data.