-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic modeling #37
base: dev
Are you sure you want to change the base?
Topic modeling #37
Conversation
], | ||
'topic_modeling': [ | ||
'gensim', | ||
'spacy' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need a spacy corpus as well?
quantgov/estimator/structures.py
Outdated
class QGLdaModel(BaseEstimator, TransformerMixin): | ||
@check_gensim | ||
@check_spacy | ||
def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think the options for stop_words
should be:
None
(default): No stop wordsTrue
: use built-in stop words- A sequence: user-specified stop words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not having any stop words seems to output a pretty unusable model - my thinking is it's best to have some default, and if the user chooses to override that default with None they can, but the defaults should be able to produce something usable - we could include some output if they don't provide any (e.g. "INFO: No stop words provided, using sklearn builtins"), and potentially a warning if None is passed
quantgov/estimator/structures.py
Outdated
class QGLdaModel(BaseEstimator, TransformerMixin): | ||
@check_gensim | ||
@check_spacy | ||
def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
word_regex
should be word_pattern
to match what already exists in SKL.
quantgov/estimator/structures.py
Outdated
@@ -85,3 +113,41 @@ class CandidateModel( | |||
parameter values to test as values | |||
""" | |||
pass | |||
|
|||
|
|||
class QGLdaModel(BaseEstimator, TransformerMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like either the prefix or the Model specifier. I'd call this GensimLDA or something like that.
quantgov/estimator/structures.py
Outdated
import re | ||
|
||
try: | ||
from spacy.lang.en.stop_words import STOP_WORDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're literally only using spacy here for the stopwords, can't we somehow find the sklearn stopwords used in the CountVectorizer
? That's got to be importable from somewhere.
for doc in driver.stream()]) | ||
stop_ids = [self.dictionary.token2id[stopword] for stopword | ||
in self.stop_words if stopword in self.dictionary.token2id] | ||
once_ids = [tokenid for tokenid, docfreq in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering out words that only occur once was recommended in the Gensim documentation - beyond that, I don't know if it actually improves the performance of the model.
quantgov/estimator/structures.py
Outdated
for i in self.word_regex | ||
.finditer(doc.text)] | ||
for doc in driver.stream()]) | ||
stop_ids = [self.dictionary.token2id[stopword] for stopword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to only pass the dictionary words that aren't in stop_words?
@OliverSherouse ready for follow up review |
No description provided.