Structure of Cobulid dictionary file

While (looking at the script, being to write and, to cause by mistake the possibility of being there is a small place)

* Files
=======

- The type of CD-ROM:
      
1. Book 3rd edition
  1.1 Collins COBUILD ON CD-ROM ISBN:0-00-710884-2 ('Version 1.0 Software')
  1.2 Collins COBUILD ON CD-ROM ISBN:0-00-715905-6 ('Version 2.0 Software')
2. Book 4th edition ('Version 3.0 Software')
  2.1 Advanced Learner's English Dictionary + CD-ROM ISBN:0-00-715799-1 Hardback
  2.2 Advanced Learner's English Dictionary + CD-ROM ISBN:0-00-715800-9 Paperback
  2.3 Collins Cobuild on CD-ROM Resource Pack        ISBN:0-00-716921-3

2.1  With the contents of 2.2 CD-ROM are thought it is identical.
As for 2.1/2.2 dictionary itself and only Wordbank.
As for the other things grammar, the thesaurus and use belong.
2.1/2.2 and in 2.3 the file of the dictionary itself  differs somewhat.

- Version 1

Ahead installing directory

EN       MBX       244,901  01-12-11  11:38 en.mbx      pictures
EN       SPL       524,073  01-12-11  11:38 en.spl      spellings
EN       WBX   549,351,419  01-12-11  11:40 en.wbx      pronunciations
EN-CC3   FTX    17,685,849  01-12-11  11:37 en-cc3.ftx
EN-CC3   REL       266,056  01-12-11  11:37 en-cc3.rel
EN-CC3   TRD    15,191,804  01-12-11  11:37 en-cc3.trd  dictionaries
EN-CGR   FTX     1,393,929  01-12-11  11:37 en-cgr.ftx
EN-CGR   TRD     1,158,940  01-12-11  11:37 en-cgr.trd  grammar
EN-CTH   FTX     1,331,337  01-12-11  11:37 en-cth.ftx
EN-CTH   TRD     1,590,516  01-12-11  11:37 en-cth.trd  thesauruses
EN-CUS   FTX     2,473,533  01-12-11  11:37 en-cus.ftx
EN-CUS   TRD     1,645,144  01-12-11  11:37 en-cus.trd  uses
EN-CWB   FTX    37,906,681  01-12-11  11:38 en-cwb.ftx
EN-CWB   TRD    25,367,548  01-12-11  11:37 en-cwb.trd  Wordbank

- Version 2

Ahead installing 'data' directory of directory
      
EN       SPL       523,273  02-09-04   4:00 en.spl          spellings
HCP-CC3  DDF         3,253  02-09-04   4:00 hcp-cc3.ddf
HCP-USR  DDF           224  02-09-04   4:00 hcp-usr.ddf
HCP_EN   WBX   549,351,419  02-09-04   4:00 hcp_en.wbx      pronunciations
HCP_EN‾1 FTX    22,913,217  02-09-04   4:00 hcp_en-cc3.ftx
HCP_EN‾1 MBX       244,901  02-09-04   4:00 hcp_en-cc3.mbx  pictures
HCP_EN‾1 REL       334,395  02-09-04   4:00 hcp_en-cc3.rel
HCP_EN‾1 TRD    15,055,464  02-09-04   4:00 hcp_en-cc3.trd  dictionaries
HCP_EN‾2 FTX     1,634,201  02-09-04   4:00 hcp_en-grm.ftx
HCP_EN‾2 REL         5,121  02-09-04   4:00 hcp_en-grm.rel
HCP_EN‾2 TRD       997,520  02-09-04   4:00 hcp_en-grm.trd  grammar
HCP_EN‾3 FTX     1,576,417  02-09-04   4:00 hcp_en-gth.ftx
HCP_EN‾3 TRD     1,467,456  02-09-04   4:00 hcp_en-gth.trd  thesauruses
HCP_EN‾4 FTX     2,978,453  02-09-04   4:00 hcp_en-usg.ftx
HCP_EN‾4 TRD     1,512,784  02-09-04   4:00 hcp_en-usg.trd  uses
HCP_EN‾5 FTX    35,691,873  02-09-04   4:00 hcp_en-wbk.ftx
HCP_EN‾5 TRD    23,348,888  02-09-04   4:00 hcp_en-wbk.trd  Wordbank

- Version 3

Ahead installing 'data' directory of directory

    523273  May 23  2003 en.spl          spellings
      3745  May 23  2003 hcp-cc3.ddf
       224  May 23  2003 hcp-usr.ddf
  19121229  May 23  2003 hcp_en-cc3.ftx
    244901  May 23  2003 hcp_en-cc3.mbx  pictures
    314898  May 23  2003 hcp_en-cc3.rel
  13251280  May 23  2003 hcp_en-cc3.trd  dictionaries
  35691873  May 23  2003 hcp_en-wbk.ftx
  23348888  May 23  2003 hcp_en-wbk.trd  Wordbank
 549351419  May 23  2003 hcp_en.wbx      pronunciations
     20150  May 23  2003 lexsmb.mbx

In case of the Resource Pack, there is also a file of grammar, the thesaurus and use.

*.trd  Text data 
*.ftx  Indexes for fulltext search 
*.rel  Relation file 
*.wbx  Pronunciation data 
*.mbx  Graphics data 
*.spl  Spelling data 
*.ddf  Dictionary explanatory file 
     
* Structure of dictionary file
==============================

Structure of *.trd file is as follows:

 Header section
    Obscurity (you do not inspect)
 Index section 
 Data division 
     
- Header section 
      
As for length 128 bytes (80h) fixing.
As for numerical value all Little endian.

Position  length 
  0h  64  Copyright notice (as for remainder 00h)
 40h  16  Unclear
 50h   4  Data number of cases
 54h   4  Unclear
 58h   4  Index of bases
 5Ch   4  Index of offsets (data number of cases + 1)
 60h   4  Unclear
 64h   4  Index region position
 68h   4  Data region position
 6Ch   4  Unclear
 70h  16  Unclear
     
- Index section 
      
It is housed with following kind of type:

  base 1
  base 2
   ...
  base n
  ofs 1..64  (for base 1)
  ofs 1..64  (for base 2)
   ...
  ofs 1..x   (for base n)

First the index base is housed index based several minutes.

Length
  4  index bases (offset from the data division first)

Consequently concerning index based 1 case, the index  offset, 64 cases at a time
It is housed in order.
Concerning the last index base, the index offset 64 cases compared to 
There is a little thing.


Length 
  2  index offsets (offset of 4 byte unit from the  index based first)


It calculates data record position with the following  formula.


Data record position = data region  position 
     + Index base 
     + (Index offset * 4)


As for length of each data record, up to the next data  record position.
As for the number of index offsets data  number of cases + 1,
In other words the last  index offset exists in order to obtain the length of the last data.


- Data division 
     
     The record  of the following type continuing, it is housed.


Length 
     1 text item several 
-       Discoverable item 
      -      Text item 1

           ...
-      Text item N


As for each item, the bit flag which shows the data elements which are housed (1 byte) with 
     Continuation of actually data.
The letter data is housed with the bitstream of 6 bit  unit.
When below, there is no especially description, as for the data the  letter data.


== structure of discoverable 
item 
     Flag 
     0x01 identification division 
0x01 entry word 
     0x02 suspicious look / key?  (Only  synonym and grammar)
0x08 part of speech (only synonym)
0x10 change shape [ CD-ROM v3.0 ]
0x20 use (only synonym)
0x80  pronunciation 
     
     0x02 additional data  
     Is made the discoverable classified by 0x01 the reason  
which      0x02 sentence structure [ Resource Pack ]
0x04 amount spelling [ CD-ROM v3.0 ]
0x08 notes  
     0x10 chapter paragraph number (only grammar)
0x20 frequency
The data displays frequency stage at  numerical value of 1 byte 
     (1..5, with CD-ROM v3.0 1..3)


0x80 voice data 
    The data of the following type continuing, it is housed.
01h:  There is a data and (the data 3 byte continues, the first data?
80h:  There is a data and (the data 3 byte continues,  the data other than the first?
00h:  Data end 
     
     == structure of text  
item      
     Flag 
     0x01 small index  
     0x01 index 
     0x02?      Difference index [  CD-ROM v3.0 ] (only 1 case of faff, compilation mistake?
0x08 part of speech 
     0x10 derivative shape  
     Discoverable explanation of whole 0x20 [ Resource Kit ]


Such as 0x02 
use      0x02 suspicious look  
     0x04 amount spelling [ CD-ROM v3.0 ]
0x08 notes (only synonym is not indicated)
0x10 use (with grammar  chapter paragraph number)
0x20?      (Only dictionary  itself)
As for data at 1 byte as for value the 01h..05h (frequency?
0x80 number or sign 
     
     0x10 text 
    0x01 meaning 
     0x02 example 
     As for data the  first 1 byte the number of examples.  Below, the example data  continues.
0x04 chart 
     As for data the  first 1 byte the number of lines.  Below, the data of each line  continues.
0x08 reference 
     As for data  the first 1 byte the number of references.  Below, each data  continues.
0x40 reference (See also) [ CD-ROM v3.0 ]


0x80 picture 
     As for data the first 1 byte flag.  The  next 4 bytes picture number.


0x40-related language 
     0x01 synonym  
     0x02 antonym 
     
     
     * Letter data 
     ============


The letter data with the  bitstream of 6 bit unit is housed at the byte unit.
As for usual letter 6 bits, usually as for letter 12  bits, you display external character in 30 bits.
As for end of letter data 000000b.


Value of the first 6  
bit      0 character string terminals 
     1.  .26   The English small letter (the A.z)
32.  .38  Usual sign 
     40.  .62  Adjusting to succeeding  letter, 1 letter (the English capital letter) such as number, sign and accent equipped
63 external characters (the  value which -1 is done respectively from the next 4 letters is  suitable to each 4 bits of unicode)


Details the  CobuidLib.rb must be reference.


The tag is used inside the  letter data.


-      < Tag.  Character string >  type EX 
B bold type (bold) [ CD-ROM v3.0 ]
E emphasis (emphasis)
I non-commutative field  (italic)
U underline?  (Underline?  Used?
C notes (comment) (informal  and Brit....)
F phrase?  (Phrase?
X cross reference (cross reference)
G grammatical  element?  (Grammar?  (Of, IN and see also...)
O option?  Pronoun?  (Your and somenoe...)
V?      (See and one's...)  [ CD-ROM v3.0 ]
W?      (WEAK and STRONG...) [ CD-ROM v3.0 ]
The A United States (American)
B England (British)
O Option?  (To, with and someone...)
E exponent?  (Exponent?  (Upper equipped number)
S sign (Symbol) [ CD-ROM  v3.0 ]
-      < Tag > type EX 
H amount  spelling letter (hyphenation) [ CD-ROM v3.0 ]
DW warning 
     Li  chart item (list item)
Lb chart start (list begin)
Le chart end (list end)
Z [ 1-9 ]?  [ A-z ]  Pronunciation idea contest   ..
A [ a-z ] the American pronunciation idea contest [  CD-ROM v3.0 ]
B [ a-z ] the English pronunciation idea contest [ CD-ROM v3.0 ]


As for type in chart 
     
  • ...
  • ... * Example of data ============ You explain the item of A as example. The data has become as follows. 05 81 01 F6 09 20 04 00 01 B3 FB 00 80 B4 FB 00 00 13 18 F6 D9 BD D7 D8 3d C4 00 F6 09 20 06 26 81 8e 0F 60 F8 74 E4 80 1f 87 4E 26 82 8c 00 80 F9 10 00 01 88 58 7d 82 38 09 4E 05 08 16 01 89 49 35 20 30 55 14 15 28 0F 1a 05 08 16 0F 64 38 73 09 4c 88 01 31 02 01 08 55 21 00 13 18 F6 D9 ... 0. Text item several As for 05 first 1 bytes the number of text items, in other words there are 5 items. 1. Discoverable item You discover 81 next 1 bytes, the flag of item. The identification division (the 0x01) with the voice data (the 0x80) it is found that it is. 1.1 Identification division As for the forefront of 01 identification divisions flag. The data, entry word (only the 0x01). 1.1.1 Entry word F6 09 20 04 00 When this is displayed at the quantities of two Shin, 11110110 00001001 00100000 00000100 00000000 You analyze at 6 bit unit, the null (the 000000b) with become data end. 111101 100000 100100 100000 000001 000000 0000 A, A 1.2 Voice data 01 B3 FB 00 80 B4 FB 00 00 You can analyze this as follows. There are 01 data, (the first data? B3 FB 00 voice data number (? There are 80 data, (the halfway data? B4 FB 00 voice data number (? 00 data ends 2.0 Text item Below, equal to the number of text items, text item continues. 2.1 Text item 1 As for 13 first bytes flag. The small index (the 0x01), use (the 0x02), the text (the 0x10) it is found that it is. 2.1.1 Small index As for 18 first bytes flag. Part of speech (the 0x08), change shape (the 0x10) it is found that it is. 2.1.1.1 Part of speech F6 D9 BD D7 D8 3d C4 00 Character string. As for value, N-VAR 2.1.1.2 Change shape F6 09 20 06 26 81 8e 0F 60 F8 74 E4 80 1f 87 4E 26 82 8c 00 Character string. As for value, A, A A's and a's The tag which shows the pronunciation idea contest. In the first data of the voice data of the identification division, corresponds to the 2nd data. 2.1.2 Use 80 flags. As for data number or sign (only 0x80). 2.1.2.1 Number or sign F9 10 00 Character string. As for value, 1 In other words, word meaning number. 2.1.3 Text 01 flags. As for data meaning (only 0x01). 2.1.3.1 Meaning 88 58 7d 82 38 09 4E 05 08 16 01 89 49 35 20 30 55 14 15 28 0F 1a 05 08 16 0F 64 38 73 09 4c 88 01 31 02 01 08 55 21 00 Character string. As for value, Is the first letter of the English alphabet. Emphasis tag. If you refer to the html, with A bold type indication. 2.2 Text item 2 Below, similar to text item 1. * Structure of graphics data file ========================== * Structure of.mbx file is as follows. Header section Unclear Table section Index section Data division - Header section As for length 128 bytes (80h)? As for numerical value all Little endian. Position length 0h 64 copyright notice (as for remainder 00h) 40h 16 unclear 50h 4 table number of cases 54h 4 index number of cases 58h 4 data number of cases 5ch 4 unclear 60h 4 table region position 64h 4 index region position 68h 4 data region position 6ch 4 unclear 70h 16 unclear - Table section You do not inspect. - Index section The record of the following type continuing, index number of cases amount, it is housed. Length 4 data positions (offset from the data division first) - Data division The record of the following type continuing, it is housed. Length 1 obscurity? - Graphics data (GIF type) The first record and the last record seem like the dummy data.