Description
This is a placeholder issue for sparse matrices support in the Histogram-based GBDT estimators.
I guess #15550 should be tackled first.
Below are my thoughts and potential plan on the matter, feel free to ignore.
Binning:
We need a utility to compute quantiles on sparse data, and we need to map a float sparse matrix to a binned sparse matrix given those quantiles. To avoid having to densify X_binned
, the zeros in X
should be mapped to bin 0, even if that's not their actual bin (called actual_bin_zeros
). I guess that means all the bins in range(0, actual_bin_zeros)
have an offset of 1, i.e. now they're actually mapped to range(1, actual_bin_zeros + 1)
. Though maybe we can avoid the offset by distinguishing between explicit and implicit zeros, IDK.
Histograms:
We need a histogram builder that can handle sparse data and that is aware of actual_bin_zeros
in some way. We can't just build the histograms as usual, because that would mean that the zeros would be treated as the lowest value in the splitter. In the histogram, the zeros should be placed in their proper bin, i.e. at index actual_bin_zeros
. This way, the splitter can be left unchanged. The offset of the bins in range(1, actual_bin_zeros)
should also be canceled here.
When building a histogram, we can focus only on the non-zeros entries. We already know the totals sum_gradients
, sum_hessians
, and count
at any given node. So we can just go through the samples that have non-zero values and fill-in the histogram at their respective bins, and then set hist[actual_bin_zeros]['grad'] = total_sum_gradients - hist[:]['grad'].sum()
.
Activity