-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds Categorical Support to Histogram Gradient Boosting #16909
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan , I'm very excited about this!
Made a first pass on the binner, more to come later.
I do appreciate the comments ;)
Regarding treating negative values as missing: I think we should only do that for pandas input and when the column is categorical. I.e., only in case we can strictly rely on pandas. Otherwise, I feel like this is enforcing an unexpected behavior: a negative category is just as fine as any other category, and I would not expect it to be treated as missing in general. We clearly don't do that in the other estimators / encoders
The maximum number of bins to use for non-missing values. If for a | ||
given feature the number of unique values is less than ``max_bins``, | ||
then those unique values will be used to compute the bin thresholds, | ||
instead of the quantiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs an update it seems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this isn't resolved. There's no notion of quantiles in this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more on the splitter, mostly questions or minor suggestions
elif bin_thresholds is not None: | ||
node['threshold'] = bin_thresholds[feature_idx][bin_idx] | ||
if split_info.is_categorical: | ||
node['cat_threshold'] = split_info.cat_threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my previous comment about avoiding binning at predict time: see how we store the real-valued threshold below for numerical data. Couldn't we here directly store the non-binned categories? Potentially as a mask?
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py
Outdated
Show resolved
Hide resolved
for threshold in thresholds: | ||
i1 = threshold // 32 | ||
i2 = threshold % 32 | ||
assert (bitset[i1] >> np.uint32(i2)) & np.uint32(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure you need to cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks if we do not cast.
sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py
Outdated
Show resolved
Hide resolved
# 3 categories (finds threshold by going left first) | ||
([0, 1, 2] * 11, # X_binned | ||
[1, 10, 1] * 11, # all_gradients | ||
[0, 2], # thresholds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the missing values going to the right child? It seems like the left child is the one with the most samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a mistake, now that the assert tests for equality, this is properly updated and tested.
sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_predictor.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_predictor.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_predictor.py
Outdated
Show resolved
Hide resolved
|
||
Returns | ||
------- | ||
y : ndarray, shape (n_samples,) | ||
The raw predicted values. | ||
""" | ||
out = np.empty(X.shape[0], dtype=Y_DTYPE) | ||
_predict_from_numeric_data(self.nodes, X, out) | ||
_predict_from_numeric_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this one shouhld be called predict_from_non_binned_data
now, because of the potential confusion with numerical vs categorical, in contrast to numerical vs binned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to _predict_from_data
for the lack of a better name. In its current state, it is taking two arrays, a non-binned numerical ndarray and a binned categorical ndarray (if it exist). This goes hand in hand with "do we want to bin during predict".
if node.missing_go_to_left: | ||
if node.is_categorical: | ||
cat_idx = orig_feature_to_binned_cat[node.feature_idx] | ||
if in_bitset(categorical_data[row, cat_idx], node.cat_threshold): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does lightgbm also do that? I.e. bin during predict, and rely on a bitset of internally encoded features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lightgbm does not bin during predict. It has a dynamically sized bitset that encodes the input categorical features, so it can accept a category with any cardinality.
Currently, the implementation accepts a category with any cardinality. If the cardinality is higher than max_bins
, then only the top max_bins
categories are kept, ranked by cardinality, and the rest are considered missing. In this way, it is also handling infrequent categories as well. This option is more flexible, but means I have to bin predict which is disappointing.
A simpler alternative would be to restrict the input to be "ints" with range ~ [0, max_bin]
and anything outside of that range will be considered missing. This would not do anything special to handle infrequent categories, but it will simplify some of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts @ogrisel ?
Personally, for a first version, I would prefer keeping things as simple as possible. As such, rxpecting ints in [0, max_bins] sounds reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go either way on this. I spoke to @amueller about this and seems to prefer the current approach of "binning categories during predict".
I originally thought that the ordering trick would only really work for the LS loss. But now that I have read the Fisher paper more in details, it seems to me that we can always apply this trick because the trees are always least-squqres regression trees (regardless of the loss). Do you agree? Does lightGBM also uses this regardless of the loss? |
qsort(cat_sort_infos, used_bin, sizeof(categorical_info), | ||
compare_cat_infos) | ||
|
||
max_num_cat = min(MAX_CAT_THRESHOLD, (used_bin + 1) / 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, two questions:
- why do we want to limit to
(used_bin + 1) / 2
? it seems to me that the rest of the splits, i.e. the other half, are still relevant? - why do we want to limit to MAX_CAT_THRESHOLD? According to optimal split for categorical features microsoft/LightGBM#699 (comment), this is used to reduce communication cost in parallel learning, but I don't understand that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I guess for the first point, we're still considering the rest of the splits because we consider both directions, each ending at the middle.
But then the question becomes: why do we need to do that? It seems that the reason is to limit the loop below for i in range(best_sort_thres + 1)
, but I can't say I'm really convinced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we have MAX_CAT_THRESHOLD=32
and it does not reach the center, going in both directions allows us to consider 32 categories from the left, and then 32 categories from the right.
If we get rid of MAX_CAT_THRESHOLD
, then I do not see a point to going in both directions.
@lorentzenchr @johannfaouzi should be intersted ;) |
# See algo 3 from the XGBoost paper | ||
# https://arxiv.org/abs/1603.02754 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole comment block should be put below now. Maybe with an additional comment that for categorical features, missing values don't need 2 scans because they are treated as a native category
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review the implementation, I just let a few remarks on the example and on typos.
|
||
.. currentmodule:: sklearn | ||
|
||
This example, we will compare the performance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example?
error_msg = ("categorical must be an array-like of bool with shape " | ||
"(n_features,) or 'pandas'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is also possible for ColumnTransformer, I agree that it would nice to also accept it here.
We will work with the Ames Lowa Housing dataset which consists of numerical | ||
and categorical features, where the houses' sales prices is the target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the categorical features useful for this classification task? It may be worth it to add another example where the categorical features are dropped, training should be faster but predictive performance should be worse. Dropping categorical features is another way to deal with them (in a dummy way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this dataset, the categories do not matter as much. So I will be on the lookout for a nicer dataset.
categorical : array-like of bool of shape (n_features) or 'pandas', \ | ||
default=None. | ||
Indicates the categorical features. | ||
- None : no features will be consider categorical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considered
BTW I will come back to this after the feature freeze for the next release, which will be happening pretty soon. |
lint failure ;) |
@thomasjpfan @ogrisel in bbae955 I pushed an equivalence test w.r.t. OHE in different scenarios, let me know what you think? |
Checking in: While debugging the differences between lightgbm and this PR, I found that for smallish trees ~ |
@thomasjpfan I made this comparison tool back then which may help you compare trees built with lightgbm: https://github.com/NicolasHug/plot_hgbdt |
I think it's good test and it has value because it is simple enough, hence easier to debug if it starts failing because of a regression introduced in a future change. However I wanted to also test that the decision function learned is the same for training set with 2 interacting categorical variables. In particular I wanted to to check that the learned decision function is also the same on datapoints with combinations of categorical variables not observed in the training set. WDYT? Maybe it would interesting to write a similar test with many categorical variables with a large enough cardinality and a number of training samples that is significantly smaller that the number of possible combinations of the input variables. The target training gradients should not be optimally predictable from a single input variable, maybe random gradients would be enough to highlight the property I am interested in. |
Hey guys, Considering:
I would suggest to simplify this PR further, and expect categories to be encoded in This will allow to greatly simplify the code, and to rely on pure Cython + numpy instead of C++. I feel like this is our best chance to get this in for 0.24 I would be happy to make those changes, if that's OK with everyone. |
By restricting to 256, I am assuming we want to still handle unknown categories? The native implementations of 256 categories that will increase the memory footprint of the predictor nodes. In this native implementation, the |
If you mean categories unknown during fit, yes we can treat as missing as currently done here Re memory footprint: I agree that allocating a 256 bits array isn't absolutely ideal, but it's not so bad either: this would only add 32B to a struct that is already 47B. For an estimator that has 100 trees with 256 leaves (i.e. 512 nodes assuming they're balanced), this accounts for an extra 1.6MB (for a total of 2.4 + 1.6 = 4MB). This is order of magnitudes lower than what would be needed for fitting. So for a first version, this is quite reasonable IMHO. I guess I'd be fine with having a pointer to something, as long as we can make it dead simple. It's unlikely that we can satisfy #16909 (comment) though. |
As the alternative #18394 was merged but as several limitations, what is the future of this PR? Continue or close and still serve as a reference? |
Let's close for now. (We can still reference it if we need to) |
Is there anywhere the additional features of this vs. #18394 are being tracked? |
So the difference between this PR and #18394 which was merged instead is that #18394 forces categories to be continuously encoded between Informally and from discussion we've had between core-devs, the focus seems to be more into: a) easier way to specify the categorical features #18894 and b) sparse matrix support #15336. |
I have use cases with more than 254 categorical levels and would be very glad to see this extension. However, handling the passage of categoricals into the HGBT, e.g. in a pipeline, has priority in my opinion. |
Reference Issues/PRs
Fixes #15550
What does this implement/fix? Explain your changes.
Currently the API for enable categorical support is to set the
categorical
parameter to a mask or'pandas'
.The splitting logic is based on lightgbm's categorical splitting. They used a
cat_smooth
(default=10) parameter that will ignore categories with cardinality lower thancat_smooth
. There is amax_cat_threshold
(default=32) that sets the number of categories to be in the threshold to go left. For this PR these defaults are hard coded in the splitter.This implementation is able to handle unknown categories as well as treating missing values as its own category. If the cardinality of a categorical feature is greater than
max_bins
, then the topmax_bins
categories based on cardinality will be kept and the less frequent categories will be considered missing.Currently,
predict
only bins the categorical features and passes that to the predictors. If a node were to split on a categorical feature, it will onlycat_threshold
, which is a boolean mask of the categories going to the left.To make pandas support a little nicer, negative values will also be encoded as missing. This is because pandas categories will give
-1
as the encoding for missing categories.check_array
was updated to include ause_pd_categorical_encoding
parameter that will use the encoding provided by pandas for encoding. It's implementation is structured in a way to make the least amount of copies.The example shows a comparison of using one hot encoder and native categorical support on the ames housing dataset. As expected, the fit times are much better and the scores are kept the same:
Any other comments?
The ultimate goal of this PR was to enable the following API:
Here is a similiar script using the adult dataset (which has missing values, 48842 samples, 12 categorical, and 2 numeric features) and how the result compares. Previously we would need impute before passing it to one one encoder. With this PR, the data can be directly passed in.
CC @NicolasHug @adrinjalali @ogrisel