Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds Categorical Support to Histogram Gradient Boosting #16909

Closed
wants to merge 89 commits into from

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Apr 13, 2020

Reference Issues/PRs

Fixes #15550

What does this implement/fix? Explain your changes.

Currently the API for enable categorical support is to set the categorical parameter to a mask or 'pandas'.

  1. The splitting logic is based on lightgbm's categorical splitting. They used a cat_smooth (default=10) parameter that will ignore categories with cardinality lower than cat_smooth. There is a max_cat_threshold (default=32) that sets the number of categories to be in the threshold to go left. For this PR these defaults are hard coded in the splitter.

  2. This implementation is able to handle unknown categories as well as treating missing values as its own category. If the cardinality of a categorical feature is greater than max_bins, then the top max_bins categories based on cardinality will be kept and the less frequent categories will be considered missing.

  3. Currently, predict only bins the categorical features and passes that to the predictors. If a node were to split on a categorical feature, it will only cat_threshold, which is a boolean mask of the categories going to the left.

  4. To make pandas support a little nicer, negative values will also be encoded as missing. This is because pandas categories will give -1 as the encoding for missing categories.

  5. check_array was updated to include a use_pd_categorical_encoding parameter that will use the encoding provided by pandas for encoding. It's implementation is structured in a way to make the least amount of copies.

  6. The example shows a comparison of using one hot encoder and native categorical support on the ames housing dataset. As expected, the fit times are much better and the scores are kept the same:

Screen Shot 2020-04-12 at 9 00 17 PM

Any other comments?

The ultimate goal of this PR was to enable the following API:

from sklearn.datasets import fetch_openml
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor

# ames housing with 46 categories and 34 numerical features
X, y = fetch_openml(data_id=41211, as_frame=True, return_X_y=True)
hist = HistGradientBoostingRegressor(categorical='pandas')
hist.fit(X, y)

Here is a similiar script using the adult dataset (which has missing values, 48842 samples, 12 categorical, and 2 numeric features) and how the result compares. Previously we would need impute before passing it to one one encoder. With this PR, the data can be directly passed in.

adult_bench

CC @NicolasHug @adrinjalali @ogrisel

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan , I'm very excited about this!

Made a first pass on the binner, more to come later.

I do appreciate the comments ;)

Regarding treating negative values as missing: I think we should only do that for pandas input and when the column is categorical. I.e., only in case we can strictly rely on pandas. Otherwise, I feel like this is enforcing an unexpected behavior: a negative category is just as fine as any other category, and I would not expect it to be treated as missing in general. We clearly don't do that in the other estimators / encoders

sklearn/ensemble/_hist_gradient_boosting/binning.py Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/binning.py Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/binning.py Outdated Show resolved Hide resolved
Comment on lines 92 to 95
The maximum number of bins to use for non-missing values. If for a
given feature the number of unique values is less than ``max_bins``,
then those unique values will be used to compute the bin thresholds,
instead of the quantiles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an update it seems

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this isn't resolved. There's no notion of quantiles in this function

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more on the splitter, mostly questions or minor suggestions

sklearn/ensemble/_hist_gradient_boosting/grower.py Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/grower.py Outdated Show resolved Hide resolved
elif bin_thresholds is not None:
node['threshold'] = bin_thresholds[feature_idx][bin_idx]
if split_info.is_categorical:
node['cat_threshold'] = split_info.cat_threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my previous comment about avoiding binning at predict time: see how we store the real-valued threshold below for numerical data. Couldn't we here directly store the non-binned categories? Potentially as a mask?

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Outdated Show resolved Hide resolved
for threshold in thresholds:
i1 = threshold // 32
i2 = threshold % 32
assert (bitset[i1] >> np.uint32(i2)) & np.uint32(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure you need to cast?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks if we do not cast.

# 3 categories (finds threshold by going left first)
([0, 1, 2] * 11, # X_binned
[1, 10, 1] * 11, # all_gradients
[0, 2], # thresholds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the missing values going to the right child? It seems like the left child is the one with the most samples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake, now that the assert tests for equality, this is properly updated and tested.

sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx Outdated Show resolved Hide resolved

Returns
-------
y : ndarray, shape (n_samples,)
The raw predicted values.
"""
out = np.empty(X.shape[0], dtype=Y_DTYPE)
_predict_from_numeric_data(self.nodes, X, out)
_predict_from_numeric_data(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this one shouhld be called predict_from_non_binned_data now, because of the potential confusion with numerical vs categorical, in contrast to numerical vs binned

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to _predict_from_data for the lack of a better name. In its current state, it is taking two arrays, a non-binned numerical ndarray and a binned categorical ndarray (if it exist). This goes hand in hand with "do we want to bin during predict".

if node.missing_go_to_left:
if node.is_categorical:
cat_idx = orig_feature_to_binned_cat[node.feature_idx]
if in_bitset(categorical_data[row, cat_idx], node.cat_threshold):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does lightgbm also do that? I.e. bin during predict, and rely on a bitset of internally encoded features?

Copy link
Member Author

@thomasjpfan thomasjpfan May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lightgbm does not bin during predict. It has a dynamically sized bitset that encodes the input categorical features, so it can accept a category with any cardinality.

Currently, the implementation accepts a category with any cardinality. If the cardinality is higher than max_bins, then only the top max_bins categories are kept, ranked by cardinality, and the rest are considered missing. In this way, it is also handling infrequent categories as well. This option is more flexible, but means I have to bin predict which is disappointing.

A simpler alternative would be to restrict the input to be "ints" with range ~ [0, max_bin] and anything outside of that range will be considered missing. This would not do anything special to handle infrequent categories, but it will simplify some of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts @ogrisel ?

Personally, for a first version, I would prefer keeping things as simple as possible. As such, rxpecting ints in [0, max_bins] sounds reasonable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go either way on this. I spoke to @amueller about this and seems to prefer the current approach of "binning categories during predict".

@NicolasHug
Copy link
Member

NicolasHug commented Apr 13, 2020

I originally thought that the ordering trick would only really work for the LS loss. But now that I have read the Fisher paper more in details, it seems to me that we can always apply this trick because the trees are always least-squqres regression trees (regardless of the loss). Do you agree? Does lightGBM also uses this regardless of the loss?

doc/modules/ensemble.rst Outdated Show resolved Hide resolved
doc/modules/ensemble.rst Outdated Show resolved Hide resolved
doc/modules/ensemble.rst Outdated Show resolved Hide resolved
doc/modules/ensemble.rst Outdated Show resolved Hide resolved
doc/modules/ensemble.rst Outdated Show resolved Hide resolved
doc/modules/ensemble.rst Outdated Show resolved Hide resolved
doc/modules/ensemble.rst Outdated Show resolved Hide resolved
qsort(cat_sort_infos, used_bin, sizeof(categorical_info),
compare_cat_infos)

max_num_cat = min(MAX_CAT_THRESHOLD, (used_bin + 1) / 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, two questions:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I guess for the first point, we're still considering the rest of the splits because we consider both directions, each ending at the middle.

But then the question becomes: why do we need to do that? It seems that the reason is to limit the loop below for i in range(best_sort_thres + 1), but I can't say I'm really convinced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we have MAX_CAT_THRESHOLD=32 and it does not reach the center, going in both directions allows us to consider 32 categories from the left, and then 32 categories from the right.

If we get rid of MAX_CAT_THRESHOLD, then I do not see a point to going in both directions.

@NicolasHug
Copy link
Member

@lorentzenchr @johannfaouzi should be intersted ;)

Comment on lines 462 to 463
# See algo 3 from the XGBoost paper
# https://arxiv.org/abs/1603.02754
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole comment block should be put below now. Maybe with an additional comment that for categorical features, missing values don't need 2 scans because they are treated as a native category

Copy link
Contributor

@johannfaouzi johannfaouzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the implementation, I just let a few remarks on the example and on typos.


.. currentmodule:: sklearn

This example, we will compare the performance of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example?

Comment on lines 103 to 104
error_msg = ("categorical must be an array-like of bool with shape "
"(n_features,) or 'pandas'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is also possible for ColumnTransformer, I agree that it would nice to also accept it here.

Comment on lines +12 to +13
We will work with the Ames Lowa Housing dataset which consists of numerical
and categorical features, where the houses' sales prices is the target.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the categorical features useful for this classification task? It may be worth it to add another example where the categorical features are dropped, training should be faster but predictive performance should be worse. Dropping categorical features is another way to deal with them (in a dummy way).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this dataset, the categories do not matter as much. So I will be on the lookout for a nicer dataset.

categorical : array-like of bool of shape (n_features) or 'pandas', \
default=None.
Indicates the categorical features.
- None : no features will be consider categorical.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considered

@thomasjpfan
Copy link
Member Author

BTW I will come back to this after the feature freeze for the next release, which will be happening pretty soon.

@amueller
Copy link
Member

amueller commented May 6, 2020

lint failure ;)

@NicolasHug
Copy link
Member

@thomasjpfan @ogrisel in bbae955 I pushed an equivalence test w.r.t. OHE in different scenarios, let me know what you think?

@thomasjpfan
Copy link
Member Author

Checking in: While debugging the differences between lightgbm and this PR, I found that for smallish trees ~ max_depth=3, that the splits are the same. The difference in performances comes from differences between the nodes' gain and value when compared to lightgbm. I'm still looking into it.

@NicolasHug
Copy link
Member

@thomasjpfan I made this comparison tool back then which may help you compare trees built with lightgbm: https://github.com/NicolasHug/plot_hgbdt

@ogrisel
Copy link
Member

ogrisel commented Sep 4, 2020

@thomasjpfan @ogrisel in bbae955 I pushed an equivalence test w.r.t. OHE in different scenarios, let me know what you think?

I think it's good test and it has value because it is simple enough, hence easier to debug if it starts failing because of a regression introduced in a future change.

However I wanted to also test that the decision function learned is the same for training set with 2 interacting categorical variables. In particular I wanted to to check that the learned decision function is also the same on datapoints with combinations of categorical variables not observed in the training set. WDYT?

Maybe it would interesting to write a similar test with many categorical variables with a large enough cardinality and a number of training samples that is significantly smaller that the number of possible combinations of the input variables. The target training gradients should not be optimally predictable from a single input variable, maybe random gradients would be enough to highlight the property I am interested in.

@NicolasHug
Copy link
Member

Hey guys,

Considering:

  • the difficulties we have moving this forward
  • the approaching next release
  • the seemingly high complexity of the code
  • @ogrisel's comment about keeping the nodes struct a structured array ENH Adds Categorical Support to Histogram Gradient Boosting #16909 (comment)
  • the fact that these estimators are still experimental
  • the fact that releasing an incomplete but useful solution is still better than never releasing a perfect solution

I would suggest to simplify this PR further, and expect categories to be encoded in [0, max_bins]. 256 categories will be enough for a lot of use-cases already. We can deal with higher cardinalities later, and at least at that point we would have a baseline against which we could compare performance. Also chances are, these future changes would all be backward-compatible.

This will allow to greatly simplify the code, and to rely on pure Cython + numpy instead of C++. I feel like this is our best chance to get this in for 0.24

I would be happy to make those changes, if that's OK with everyone.

@thomasjpfan
Copy link
Member Author

I would suggest to simplify this PR further, and expect categories to be encoded in [0, max_bins]. 256 categories will be enough for a lot of use-cases already. We can deal with higher cardinalities later, and at least at that point we would have a baseline against which we could compare performance. Also chances are, these future changes would all be backward-compatible.

By restricting to 256, I am assuming we want to still handle unknown categories?

The native implementations of 256 categories that will increase the memory footprint of the predictor nodes. In this native implementation, the node_struct adds a static array to hold on to the categorical_bitset I would prefer node_struct to have a pointer to a memory space with the categorical_bitset. The nodes that use a numerical threshold would not need to allocate space for the categorical bitset. This means another object would hold on the memory for the categorical bitsets. Do you see

@NicolasHug
Copy link
Member

By restricting to 256, I am assuming we want to still handle unknown categories?

If you mean categories unknown during fit, yes we can treat as missing as currently done here

Re memory footprint: I agree that allocating a 256 bits array isn't absolutely ideal, but it's not so bad either: this would only add 32B to a struct that is already 47B. For an estimator that has 100 trees with 256 leaves (i.e. 512 nodes assuming they're balanced), this accounts for an extra 1.6MB (for a total of 2.4 + 1.6 = 4MB). This is order of magnitudes lower than what would be needed for fitting. So for a first version, this is quite reasonable IMHO.

I guess I'd be fine with having a pointer to something, as long as we can make it dead simple. It's unlikely that we can satisfy #16909 (comment) though.

@lorentzenchr
Copy link
Member

As the alternative #18394 was merged but as several limitations, what is the future of this PR? Continue or close and still serve as a reference?

@thomasjpfan
Copy link
Member Author

Let's close for now. (We can still reference it if we need to)

@h-vetinari
Copy link

Is there anywhere the additional features of this vs. #18394 are being tracked?

@NicolasHug
Copy link
Member

NicolasHug commented Dec 23, 2020

So the difference between this PR and #18394 which was merged instead is that #18394 forces categories to be continuously encoded between [0, 254]. This PR (or more likely a variant thereof) would not restrict an upper bound on the category value. There's not issue tracking this ATM, as far as I know. Do you have use-cases with more than 254 unique category values @h-vetinari ?

Informally and from discussion we've had between core-devs, the focus seems to be more into: a) easier way to specify the categorical features #18894 and b) sparse matrix support #15336.

@lorentzenchr
Copy link
Member

I have use cases with more than 254 categorical levels and would be very glad to see this extension. However, handling the passage of categoricals into the HGBT, e.g. in a pipeline, has priority in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support categorical features in HistGradientBoosting estimators
7 participants