capricorn is a lightweight library for helping prepare vocabulary from corpus and prepare word embedding ready to be used by learning models.
- build vocabulary from corpus
- load necessary word embedding with consistent word index in Vocabulary
pip install capricorn
import capricorn
import os
# Specify filepaths
Vocab_path = "vocab_processor"
embedding_vector_path = "path/to/embedding"
# Load vocab
if os.path.isfile(Vocab_path):
print("Loading Vocabulary ...")
vocab_processor = capricorn.VocabularyProcessor.restore(Vocab_path)
else: # build vocab
print("Building Vocabulary ...")
x_text = ["Saudi Arabia Equity Movers: Almarai, Jarir Marketing and Spimaco.",
"Orange, Thales to Get French Cloud Computing Funds, Figaro Says.",
"Stansted Could Double Passengers on Deregulation, Times Reports."]
# Build/load vocabulary
max_document_length = 11
min_freq_filter = 2
vocab_processor = capricorn.VocabularyProcessor(max_document_length=max_document_length,
min_frequency=min_freq_filter)
# only fit
# vocab_processor.fit(x_text)
# or fit_transform to get the transformed corpus
x_text_transformed = vocab_processor.fit_transform(x_text)
vocab_processor.save(Vocab_path)
print("vocab_processor saved at:", Vocab_path)
# build embedding matrix of which the index is consistent with vocab word2index mapping
embedding_matrix = vocab_processor.prepare_embedding_matrix_with_dim(embedding_vector_path, 300)
The library default to use special token __UNK__ and __PAD__, if the input sequence lengths below the max_document_length when initial VocabularyProcessor, it will automatically pad the sequence use the __PAD__.
If user have pre defined special tokens when initialize Vocabulary, user need to pre-process the sequence, namely adding the self defined special tokens to the input sequence. For example if user defined __START__ and __END__ as additional special tokens and max_document_length=11, User has to process the original sentence from:
"We like it very much"
to:
"__START__ __PAD__ __PAD__ We like it very much __PAD__ __PAD__ __END__"