The builder requires a Java™ Runtime Environment (JRE) version 1.4+.
A powerful machine is recommendable. The compression algorithm used by the builder is very memory intensive. You should have at least 128Mb of memory (or of swap space at a pinch) on your machine. The dictbuilder
script sets the memory limit to a high value. In exceptional cases, you might have to set it higher. As an indication, compiling our big French word list (800,000 words, 9.8Mb) requires 30 seconds and 87Mb of memory on a 1 GHz Pentium III.
The builder is no longer included in the SDK. It must be downloaded separately from www.xmlmind.com/xmleditor/dictbuilder.shtml.
In all cases, the builder is a command-line utility: a shell file named dictbuilder
on Unix or MacOS, dictbuilder.bat
on Windows.
General form of the command line:
dictbuilder ?options? word_list ... word_list ?-sub word_list ... word_list?
It is also possible to use a compiled dictionary as input. This is the way to create a new version of an existing dictionary if you do not possess the source word list.
General options:
character_encoding
Encoding used in word lists, frequent word list and hints files. This must be an encoding supported by Java™ runtime.
This option must be placed before the files it applies to.
hints_file
Specifies the hints file.
Specifying a hints file is almost always needed as this file is used to specify which characters may be used to form a word.
The hints files used to build XMLmind's en
, fr
, de
, and es
dictionaries are found here: en.hints, fr.hints, de.hints, es.hints. Note that the encoding of all these hints files is ISO-8859-1.
word_list
List of frequent words.
word_list
List of standard prefixes.
word_list
... word_list
Every word list whose path follows this option will be subtracted from the resulting dictionary, instead of being merged with. It means that every word belonging to this word list will be absent from the result. This option should be placed after the input word lists.
output_file
Specifies the compiled dictionary output file. The convention is to use a .cdi
extension, but there is no obligation.
Other options:
Explain what is being done.
out_word_list
After merging all the compiled and textual word lists specified in the command line and after subtracting words if the -sub
option is used, output the resulting word list in specified text file. As always, the encoding of the generated text file is specified using the -cs
option.
Example 1: Create compiled dictionary mylang.cdi
out of word lists mywords.txt
and extrawords.txt
. The encoding of all text files specified in the command line is ISO-8859-2
. Hints file is mylang.hints
. Frequent words are contained in frqw.txt
. Standard prefixes are contained in myprefixes.txt
.
dictbuilder -cs ISO-8859-2 -hints mylang.hints -freq frqw.txt -prefixes myprefixes.txt \ mywords.txt extrawords.txt -o mylang.cdi
Example 2: Add words contained in added_words.txt
to compiled dictionary de.cdi
. Compile the resulting word list as new_de.cdi
.
dictbuilder -cs ISO-8859-1 -hints de.hints de.cdi added_words.txt -o new_de.cdi
Example 3: Subtract words contained in removed_words.txt
from compiled dictionary de.cdi
. Compile the resulting word list as new_de.cdi
.
dictbuilder -cs ISO-8859-1 -hints de.hints de.cdi \ -sub removed_words.txt -o new_de.cdi
Example 4: Output in text file de.txt
all the words contained in compiled dictionary de.cdi
.
dictbuilder -verbose -cs ISO-8859-1 -hints de.hints de.cdi -dump de.txt