Start with a lexicon file that is plain text file listing one word/phrase on each line. I have used the UKACD18 file (after editing it a bit) for English, and a few other sources for other languages. You can find several options for English at the Qxw site.
Say this file is called words.txt
.
Just use the command make
to build all the binaries.
-
Grab all of English Wikipedia (this part is essentially taken from the steps outlined in the Nutrimatic project.
-
Download the latest Wikipedia database dump (this is an ~18GB file!):
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
- Extract the text from the articles using Wikipedia Extractor (this generates ~16GB of text, and can take several hours!):
wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py
python3 WikiExtractor.py --infn enwiki-latest-pages-articles.xml.bz2
This will write a giant file named wiki.txt
. You may kill the extractor process
once wc -l wiki.txt
crosses 30,000,000 (as the next add-wiki-popularity
step
reads at most 30M lines).
- Run
add-wiki-popularity
. This might take a couple of hours.
cat wiki.txt | ./add-wiki-popularity English words.txt > importance-and-words.tsv
The created file importance-and-words.txt is a copy of words.txt with a numeric occurrence count prefixed to each line, with a tab character as the separator.
- Run it on the
importance-and-words.tsv
file. - The output will be a file containing JavaScript code that creates an object
called
exetLexicon
that has an array calledlexicon
of all the words, with an empty string at index 0), an array calledimportance
containing all the importance scores, and an object calledindex
that maps various indexing keys to arrays of word indices, and an array called anagrams that is a sharded index for searching for anagrams. It also has arrays phones and a sharded index phindex, for pronunciations. - The needed parameter is the name of a file that contains pronunciations. in a simple TSV format (word\tpronunciation). The pronunciation can be in ARPAbet or IPA format.
- For English, you can derive it from CMUdict get it here, (please follow its license instructions).
- If you don't have pronunciations available, just create an empty file.
- The
crossed_words.txt
file can contain a list of words to avoid (such as profanities or offesive words). You can pass an empty file if you do not have/want such a list.
./index-word-list English importance-and-words.txt words_and_phones.tsv crossed_words.txt > lufz-en-lexicon.js
For English, the generated lufz-en-lexicon.js file will have the lines:
/**
* --- Paste contents of lufz-en-lexicon-stems-patch.js below. ---
* --- Generate it using lufz-en-lexicon-get-stems-patch.html ---
*/
Copy over the generated file into the stemming/
folder. Make sure that the stemming/
folder also contains a copy of
wink-porter2-stemmer.js
,
and then open the HTML file in the stemming/
folder, named
lufz-en-lexicon-get-stems-patch.html
, in a web browser. This will save a file named
lufz-en-lexicon-stems-patch.js
to the browser's Downloads folder. Copy and paste the contents of this
downloaded file at the location identified by the above comment, into lufz-en-lexicon.js
.
The exetLexicon.index object has keys that look like 'AB???': When you want to look for a phrase with only some letters known, replace all unknown letters by '?', get rid of all spaces, uppercase the string and then look in index. If not found, iteratively replace the last known character with '?' and look up again. When you get a hit, go through it to keep it only if it matches the original, unmodified key.
The exetLexicon.anagrams array is of length 2000. Each entry is an array of lexicon indices. To find anagrams of a string, uppercase it, remove all unknown characters and spaces, sort it (this is the "key"), take the JavaHash() of the key modulo 2000 (adding 2000 if negative), to find the shard index. Go through all entries in the shard (~100) and filter out those that do not have the exact same key.
The exetLexicon.phindex array is just like the anagrams array, but is an index of the pronunciations.
I wrote this code for use in the Exet project, which is a web app for crossword construction.