Support sparse matrices in HistGradientBoosting estimators

This is a placeholder issue for sparse matrices support in the Histogram-based GBDT estimators.

I guess #15550 should be tackled first.

----


Below are my thoughts and potential plan on the matter, feel free to ignore.

Binning:

We need a utility to compute quantiles on sparse data, and we need to map a float sparse matrix to a binned sparse matrix given those quantiles.  To avoid having to densify `X_binned`, the zeros in `X` should be mapped to bin 0, even if that's not their actual bin (called `actual_bin_zeros`). I guess that means all the bins in `range(0, actual_bin_zeros)` have an offset of 1, i.e. now they're actually mapped to `range(1, actual_bin_zeros + 1)`.  Though maybe we can avoid the offset by distinguishing between explicit and implicit zeros, IDK.

Histograms:

We need a histogram builder that can handle sparse data *and* that is aware of `actual_bin_zeros` in some way. We can't just build the histograms as usual, because that would mean that the zeros would be treated as the lowest value in the splitter.  In the histogram, the zeros should be placed in their proper bin, i.e. at index `actual_bin_zeros`. This way, the splitter can be left unchanged. The offset of the bins in `range(1, actual_bin_zeros)` should also be canceled here.

When building a histogram, we can focus only on the non-zeros entries. We already know the totals `sum_gradients`, `sum_hessians`, and `count` at any given node. So we can just go through the samples that have non-zero values and fill-in the histogram at their respective bins, and then set `hist[actual_bin_zeros]['grad'] = total_sum_gradients - hist[:]['grad'].sum()`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support sparse matrices in HistGradientBoosting estimators #16885

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Support sparse matrices in HistGradientBoosting estimators #16885

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions