Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

jmwoloso · 2019-10-22T16:23:30Z

Description

Hi!

I'm receiving the error below when attempting to pass a sparse matrix to HistGradientBoostingClassifier. The matrix is the result of using CountVectorizer and TfidfTransformer on input text.

In my case, the size of the text prohibits converting the sparse matrix to a dense one (I run out of memory).

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import HistGradientBoostingClassifier

df = pd.read_csv(...)

vectorizer = CountVectorizer()
tfidf = TfidfTransformer()
clf = HistGradientBoostingClassifier()

vecs = vectorizer.fit_transform(df.loc[:, "very_large_text"])
vecs = tfidf.fit_transform(vecs)

clf.fit(vecs, df.loc[:, "label"])

Expected Results

No error is thrown.

Actual Results

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Versions

System:
python: 3.7.3 (default, Oct 1 2019, 18:28:53) [GCC 5.4.0 20160609]
executable: /local_disk0/pythonVirtualEnvDirs/virtualEnv-3631eab5-084b-4139-952e-5aff594ac1bb/bin/python
machine: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid

Python deps:
pip: 19.0.3
setuptools: 40.8.0
sklearn: 0.21.3
numpy: 1.16.2
scipy: 1.2.1
Cython: 0.29.6
pandas: 0.24.2

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2019-10-22T16:33:26Z

Thank you for posting this feature request. We can discuss what kind of semantics we want for sparse matrix support. I.E. we can treat zero as missing or a literal zero. LightGBM uses a parameter to decide which semantic to use.

jmwoloso · 2019-10-22T16:46:55Z

No problem. Without knowing the full extent of what is required, I'd be happy to try and tackle it with your guidance on where to look, etc.

jnothman · 2019-10-22T22:40:02Z

Zero semantics would be consistent with every other estimator (except for pairwise data).

NicolasHug · 2020-12-23T08:59:15Z

For ref I had noted some implem suggestions in #16885

I believe @StealthyKamereon wants to give it a shot.

Regarding semantics of zeros: we can have a boolean parameter zero_as_missing as LightGBM. For a first version, this is not necessary though, and we should treat zeros as literal zeros for the PR to be as small as possible.

StealthyKamereon · 2020-12-23T11:21:30Z

Following what you said regarding semantics of zeros, I think in addition to the zero_as_missing parameter there should be a categorical_missing_values which would set the missing values for categorical features.
Or maybe something like zero_as: str or list of ndarray of shape (n_cats,), default="missing"

Fix of Issue scikit-learn#15336

Apoorvgarg-creator · 2024-02-26T09:17:56Z

Can anyone give some temporary approach to solve this problem ?

TomDLT added Enhancement Moderate Anything that requires some knowledge of conventions and best practices labels Oct 22, 2019

StealthyKamereon mentioned this issue Dec 22, 2020

Support sparse matrices in HistGradientBoosting estimators #16885

Closed

NicolasHug added Hard Hard level of difficulty and removed Moderate Anything that requires some knowledge of conventions and best practices labels Dec 23, 2020

NicolasHug mentioned this issue Dec 23, 2020

ENH Adds Categorical Support to Histogram Gradient Boosting #16909

Closed

StealthyKamereon linked a pull request Jan 16, 2021 that will close this issue

[WIP] Add sparse matrix support for histgradientboostingclassifier #19187

Open

cmarmo added the module:ensemble label Jan 19, 2021

SherryLi11913 mentioned this issue Mar 27, 2022

Fix of Issue #15336 SherryLi11913/scikit-learn#4

Merged

SherryLi11913 added a commit to SherryLi11913/scikit-learn that referenced this issue Mar 27, 2022

Merge pull request #4 from SherryLi11913/DEV-19

46aab18

Fix of Issue scikit-learn#15336

cmarmo added the help wanted label Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

jmwoloso commented Oct 22, 2019

thomasjpfan commented Oct 22, 2019

jmwoloso commented Oct 22, 2019

jnothman commented Oct 22, 2019 via email

NicolasHug commented Dec 23, 2020

StealthyKamereon commented Dec 23, 2020 •

edited

Loading

Apoorvgarg-creator commented Feb 26, 2024

Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

Comments

jmwoloso commented Oct 22, 2019

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

thomasjpfan commented Oct 22, 2019

jmwoloso commented Oct 22, 2019

jnothman commented Oct 22, 2019 via email

NicolasHug commented Dec 23, 2020

StealthyKamereon commented Dec 23, 2020 • edited Loading

Apoorvgarg-creator commented Feb 26, 2024

StealthyKamereon commented Dec 23, 2020 •

edited

Loading