model_tfidf.Rd
Initialise a model based on the document frequencies of all its features.
model_tfidf(mm, normalize = FALSE, smart = "nfc", pivot = NULL, slope = 0.25, ...) # S3 method for mm_file model_tfidf(mm, normalize = FALSE, smart = "nfc", pivot = NULL, slope = 0.25, ...) # S3 method for mm model_tfidf(mm, normalize = FALSE, smart = "nfc", pivot = NULL, slope = 0.25, ...) # S3 method for python.builtin.list model_tfidf(mm, normalize = FALSE, smart = "nfc", pivot = NULL, slope = 0.25, ...) # S3 method for python.builtin.tuple model_tfidf(mm, normalize = FALSE, smart = "nfc", pivot = NULL, slope = 0.25, ...) load_tfidf(file)
mm |
A matrix market as returned by |
---|---|
normalize |
ormalize document vectors to unit euclidean length? You can also inject your own function into normalize. |
smart | SMART (System for the Mechanical Analysis and Retrieval of Text)
Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in
the vector space model. The mnemonic for representing a combination of weights takes the form XYZ,
for example |
pivot | You can either set the pivot by hand, or you can let Gensim figure it out automatically with the following two steps:
1) Set either the u or b document normalization in the smartirs parameter.
2) Set either the corpus or dictionary parameter. The pivot will be automatically determined from the properties of the corpus or dictionary.
If pivot is |
slope | Setting the slope to 0.0 uses only the pivot as the norm, and setting the slope to 1.0 effectively disables pivoted document length normalization. Singhal [2] suggests setting the slope between 0.2 and 0.3 for best results. |
... | Any other options, from the official documentation. |
file | Path to a saved model. |
Term frequency weighing:
b
- binary
t
or n
- raw
a
- augmented
l
- logarithm
d
- double logarithm
L
- Log average
Document frequency weighting:
x
or n
- none
f
- idf
t
- zero-corrected idf
p
- probabilistic idf
Document normalization:
x
or n
- none
c
- cosine
u
- pivoted unique
b
- pivoted character length
#> → Preprocessing 9 documents #> ← 9 documents after perprocessingdictionary <- corpora_dictionary(docs) corpora <- doc2bow(dictionary, docs) # fit model tfidf <- model_tfidf(corpora)