Preprocessing

First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.

library(gensimr)

set.seed(42) # rerproducability

# sample data
data(corpus, package = "gensimr")

# preprocess corpus
docs <- prepare_documents(corpus)
#> → Preprocessing 9 documents
#> ← 9 documents after perprocessing

Model

Word2vec works somewhat differently. The example below is a reproduction of the Kaggle Gensim Word2Vec Tutorial.

# initialise
word2vec <- model_word2vec(size = 100L, window = 5L, min_count = 1L)
word2vec$build_vocab(docs) 
#> None
word2vec$train(docs, total_examples = word2vec$corpus_count, epochs = 20L)
#> (76, 580)
word2vec$init_sims(replace = TRUE)
#> None

Now we can explore the model.

word2vec$wv$most_similar(positive = c("interface"))
#> [('computer', 0.23181433975696564), ('graph', 0.11893773078918457), ('minors', 0.09199836105108261), ('eps', 0.06503799557685852), ('user', 0.04753843694925308), ('time', -0.008810970932245255), ('system', -0.011411845684051514), ('response', -0.01997048407793045), ('human', -0.029993511736392975), ('survey', -0.052159011363983154)]

We expect “trees” to be the odd one out, it is a term that was in a different topic (#2) whereas other terms were in topics #1.

word2vec$wv$doesnt_match(c("human", "interface", "trees"))
#> interface

Test similarity between words.

word2vec$wv$similarity("human", "trees")
#> 0.024661217
word2vec$wv$similarity("eps", "survey")
#> -0.10218239

Phrases

Automatically detect common phrases – multi-word expressions / word n-grams – from a stream of sentences.

Here we use and example dataset. The idea is that it is saved to a file (on disk) thereby allowing gensim to stream its content which is much more efficient than loading everything in memory before runnig the model.

Let’s look at the content of the example file.

file <- datapath('testcorpus.txt') # example dataset
readLines(file) # just to show you what it looks like
#> [1] "computer human interface"                 
#> [2] "computer response survey system time user"
#> [3] "interface system user eps"                
#> [4] "human system system eps"                  
#> [5] "response time user"                       
#> [6] "trees"                                    
#> [7] "trees graph"                              
#> [8] "trees graph minors"                       
#> [9] "survey graph minors"

We observe that it is very similar to the output of prepare_documents(corpus) (the docs) object in this document. We can now scan the file to build a corpus with text8corpus

sentences <- text8corpus(file)
phrases <- phrases(docs, min_count = 1L, threshold = 1L)

That simple, now we can apply the model to new sentences.

sentence <- list('trees', 'graph', 'minors')
wrap(phrases, sentence)
#> ['trees', 'graph_minors']

We can add vocabulary to an already trained model with.

phrases$add_vocab(list(list("hello", "world"), list("meow")))
#> None

We can create a faster model with.

bigram <- phraser(phrases)
wrap(bigram, sentence)
#> ['trees', 'graph_minors']