ENH Adds Categorical Support to Histogram Gradient Boosting #16909

thomasjpfan · 2020-04-13T01:20:24Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Currently the API for enable categorical support is to set the categorical parameter to a mask or 'pandas'.

The splitting logic is based on lightgbm's categorical splitting. They used a cat_smooth (default=10) parameter that will ignore categories with cardinality lower than cat_smooth. There is a max_cat_threshold (default=32) that sets the number of categories to be in the threshold to go left. For this PR these defaults are hard coded in the splitter.
This implementation is able to handle unknown categories as well as treating missing values as its own category. If the cardinality of a categorical feature is greater than max_bins, then the top max_bins categories based on cardinality will be kept and the less frequent categories will be considered missing.
Currently, predict only bins the categorical features and passes that to the predictors. If a node were to split on a categorical feature, it will only cat_threshold, which is a boolean mask of the categories going to the left.
To make pandas support a little nicer, negative values will also be encoded as missing. This is because pandas categories will give -1 as the encoding for missing categories.
check_array was updated to include a use_pd_categorical_encoding parameter that will use the encoding provided by pandas for encoding. It's implementation is structured in a way to make the least amount of copies.
The example shows a comparison of using one hot encoder and native categorical support on the ames housing dataset. As expected, the fit times are much better and the scores are kept the same:

Any other comments?

The ultimate goal of this PR was to enable the following API:

from sklearn.datasets import fetch_openml
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor

# ames housing with 46 categories and 34 numerical features
X, y = fetch_openml(data_id=41211, as_frame=True, return_X_y=True)
hist = HistGradientBoostingRegressor(categorical='pandas')
hist.fit(X, y)

Here is a similiar script using the adult dataset (which has missing values, 48842 samples, 12 categorical, and 2 numeric features) and how the result compares. Previously we would need impute before passing it to one one encoder. With this PR, the data can be directly passed in.

CC @NicolasHug @adrinjalali @ogrisel

NicolasHug

Thanks @thomasjpfan , I'm very excited about this!

Made a first pass on the binner, more to come later.

I do appreciate the comments ;)

Regarding treating negative values as missing: I think we should only do that for pandas input and when the column is categorical. I.e., only in case we can strictly rely on pandas. Otherwise, I feel like this is enforcing an unexpected behavior: a negative category is just as fine as any other category, and I would not expect it to be treated as missing in general. We clearly don't do that in the other estimators / encoders

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/binning.py

NicolasHug · 2020-04-13T12:20:01Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+        The maximum number of bins to use for non-missing values. If for a
+        given feature the number of unique values is less than ``max_bins``,
+        then those unique values will be used to compute the bin thresholds,
+        instead of the quantiles


This needs an update it seems

Sorry, this isn't resolved. There's no notion of quantiles in this function

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

NicolasHug

Some more on the splitter, mostly questions or minor suggestions

sklearn/ensemble/_hist_gradient_boosting/grower.py

NicolasHug · 2020-04-13T14:19:17Z

sklearn/ensemble/_hist_gradient_boosting/grower.py

-        elif bin_thresholds is not None:
-            node['threshold'] = bin_thresholds[feature_idx][bin_idx]
+        if split_info.is_categorical:
+            node['cat_threshold'] = split_info.cat_threshold


Related to my previous comment about avoiding binning at predict time: see how we store the real-valued threshold below for numerical data. Couldn't we here directly store the non-binned categories? Potentially as a mask?

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx