Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

Open
jmwoloso opened this issue Oct 22, 2019 · 6 comments · May be fixed by #19187
Open

Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

jmwoloso opened this issue Oct 22, 2019 · 6 comments · May be fixed by #19187

Comments

@jmwoloso
Copy link
Contributor

Description

Hi!

I'm receiving the error below when attempting to pass a sparse matrix to HistGradientBoostingClassifier. The matrix is the result of using CountVectorizer and TfidfTransformer on input text.

In my case, the size of the text prohibits converting the sparse matrix to a dense one (I run out of memory).

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import HistGradientBoostingClassifier

df = pd.read_csv(...)

vectorizer = CountVectorizer()
tfidf = TfidfTransformer()
clf = HistGradientBoostingClassifier()

vecs = vectorizer.fit_transform(df.loc[:, "very_large_text"])
vecs = tfidf.fit_transform(vecs)

clf.fit(vecs, df.loc[:, "label"])

Expected Results

No error is thrown.

Actual Results

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Versions

System:
python: 3.7.3 (default, Oct 1 2019, 18:28:53) [GCC 5.4.0 20160609]
executable: /local_disk0/pythonVirtualEnvDirs/virtualEnv-3631eab5-084b-4139-952e-5aff594ac1bb/bin/python
machine: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid

Python deps:
pip: 19.0.3
setuptools: 40.8.0
sklearn: 0.21.3
numpy: 1.16.2
scipy: 1.2.1
Cython: 0.29.6
pandas: 0.24.2

@thomasjpfan
Copy link
Member

Thank you for posting this feature request. We can discuss what kind of semantics we want for sparse matrix support. I.E. we can treat zero as missing or a literal zero. LightGBM uses a parameter to decide which semantic to use.

@jmwoloso
Copy link
Contributor Author

No problem. Without knowing the full extent of what is required, I'd be happy to try and tackle it with your guidance on where to look, etc.

@TomDLT TomDLT added Enhancement Moderate Anything that requires some knowledge of conventions and best practices labels Oct 22, 2019
@jnothman
Copy link
Member

jnothman commented Oct 22, 2019 via email

@NicolasHug
Copy link
Member

For ref I had noted some implem suggestions in #16885

I believe @StealthyKamereon wants to give it a shot.

Regarding semantics of zeros: we can have a boolean parameter zero_as_missing as LightGBM. For a first version, this is not necessary though, and we should treat zeros as literal zeros for the PR to be as small as possible.

@NicolasHug NicolasHug added Hard Hard level of difficulty and removed Moderate Anything that requires some knowledge of conventions and best practices labels Dec 23, 2020
@StealthyKamereon
Copy link

StealthyKamereon commented Dec 23, 2020

Following what you said regarding semantics of zeros, I think in addition to the zero_as_missing parameter there should be a categorical_missing_values which would set the missing values for categorical features.
Or maybe something like zero_as: str or list of ndarray of shape (n_cats,), default="missing"

@Apoorvgarg-creator
Copy link

Can anyone give some temporary approach to solve this problem ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants