[WIP] Add sparse matrix support for histgradientboostingclassifier #19187

StealthyKamereon · 2021-01-16T23:00:00Z

This pull request adds a crude support of sparse matrix for HistGradientBoostingClassifier.
The handling of the sparse matrix is rather crude to minimize the PR and avoid touching cython for now.

Reference Issues/PRs

Fixes #15336

What does this implement/fix? Explain your changes.

It implements sparse matrix support for HistGradientBoostingClassifier.
There are two major differences when handling sparse matrix : during fit and during transform.

Fit

At fitting time, the sparse matrix can be safely converted to a dense one because only unique values are important.
This is done by getting the non-null values and then appending 0 to the list of values.
The matrix can now be treated the same way a dense one is.

That way, it avoids duplicating the code and having to test a new algorithm.

Transform

Transform is more tricky.
The main problem here (as @NicolasHug explained) is that we want the bin with value 0 to be mapped to the value 0.
That way we can take advantage of the sparse structure to treat missing values as 0.
In the returned matrix, values in the bins [0, actual_bin_zero) are right-shifted by one and the bin actual_bin_zero is mapped to 0.
The transform function returns actual_bin_zero alongside the matrix.

The second short-term problem is that the core of the binning logic is done in _map_to_bins which is a cython function.
I prefer not having to touch to the cython code for now so I worked around it.

The process is the following :

We pass a 1x1 zero-filled matrix to _map_to_bins to determine the actual_bin_zero.
We pack the sparse matrix into a dense one (process described in details bellow)
We pass the packed matrix to _map_to_bins to map the bins.
We shift the mapped bins.
We unpack the dense matrix to a sparse one, conserving the original structure.

Packing / Unpacking

To allow _map_to_bins to "process" a sparse matrix, we pack it into a dense one.
This is done column wise and the resulting matrix is the smallest possible that can fit the data.
The function returns the packed matrix and the row indices of the former values.

Unpacking the matrix is the exact opposite : we use the row indices to put the values into the correct position in the sparse matrix.

Example :

>>> a = csr_matrix(array(
      [[   1, 1000,    0],
       [   0,  999,   -1],
       [   2,    0,   -2],
       [   3,    0,   -3],
       [   0,    0,   -4]]))
>>> _pack_matrix(a)
(
array([[   1, 1000,   -1],
       [   2,  999,   -2],
       [   3,    0,   -3],
       [   0,    0,   -4]]), 
array([[ 0.,  0.,  1.],
       [ 2.,  1.,  2.],
       [ 3., nan,  3.],
       [nan, nan,  4.]])
)

Any other comments?

The main performance bottleneck is the packing / unpacking of the matrix.
I kept that fact in mind and always used the sparsity of the matrix to my advantage.

In the future, the cython code need to be retouched to make place to a cleaner flow.

…_support_for_histgradientboostingclassifier

NicolasHug · 2021-01-17T01:54:47Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+        # Since col_data is only used for get unique values
+        # and computing percentile, zeros don't matter
+        values = col_data[col_data.nonzero()[0]].toarray()
+        if 0 in col_data:
+            values = np.append(values, 0)
+        col_data = values


Thanks for the PR @StealthyKamereon .

Do I understand correctly that this essentially removes all zeros but one within the column?

If that's the case, then the behaviour won't be equivalent to the dense case:

#dense case col_data = np.random.randn(1000) col_data[:900] = 0 max_bins = 30 percentiles = np.linspace(0, 100, num=max_bins + 1) percentiles = percentiles[1:-1] np.percentile(col_data, percentiles, interpolation='midpoint') # gives array([-0.32659394, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.5365524 ])

!=

# proposed sparse case col_data2 = col_data[np.flatnonzero(col_data)] col_data2 = np.append(col_data2, 0) np.percentile(col_data2, percentiles, interpolation='midpoint' # gives array([-1.59708717, -1.28013648, -1.06639733, -0.93588368, -0.86844474, -0.77952759, -0.67911702, -0.61731544, -0.5256898 , -0.32659394, -0.19481156, -0.05218316, 0.00542425, 0.0535143 , 0.1421762 , 0.24276456, 0.27940835, 0.36767042, 0.44821144, 0.5365524 , 0.64450162, 0.67886618, 0.77848088, 0.87048614, 1.08061173, 1.25631194, 1.50380088, 1.60707389, 1.88223662])

col_data2 basically corresponds to what this PR does to the sparse case and it's very different from the above dense case.

In any case, it would be good to have a test ensuring the consistency of the output for the dense and sparse cases (with various max_bins, sparsity levels, and number of unique values)

You are right, I naively thought that the output would be the same but it's definitely not the case.
I added some tests as suggested but concerning the computation, does that mean we just want to densify col_data ?

Now that I think more about this, it might actually make sense to not have an equivalence in the bin frontiers computation between the sparse and the dense case.

If a column has tons of zeros, just computing the percentiles would lead to non-informative bin frontiers as most of them would be zeros (see my updated message above with the actual values).

OTOH if we first remove all zeros we'll have better bins frontiers... but we'd be missing 2 important ones: 0 - eps and 0 + eps.

I know that LightGBM manually forces these 2 bins frontiers to exist (whether the matrix is sparse or not BTW, but that's irrelevant).

Maybe a good strat for the sparse case would be to do almost as you did but:

compute percentiles on the densified non-zero values without re-inserting a 0, and ask for n - 2 midpoints instead of n

manually insert 0 - eps and 0 + eps (simple binary search would do here) so we end up with the required n mid points.

The equivlence between sparse and dense case would not hold, but it actually looks more sensible this way. Thoughts @StealthyKamereon @ogrisel @lorentzenchr ?

I agree, this is not very visible on our dummy examples but if we look at the sparsity of the matrix generated for #15336, we can imagine that every useful information will be squeezed up.
What do you suggest for eps ? Something like 1e-7 or bigger ?

np.finfo(X_DTYPE).eps should be what we need here

Then what about transform ?
If we computed the bins ignoring all the implicit zeros, it feels wrong mapping those zeros to the actual_bin_zero 🤔

Hm indeed, transform should definitely not use the same code as for the dense case and indeed zeros should be mapped to bin 0. In this case we might not need to insert the 2 additional bin frontiers, and just let the sparse transorm properly deal with the zeros? I guess the sparse transform code would look like:

if x is an implicit 0: map to bin 0 else: bin = use same binary search as in dense transform map to bin + 1

I don't understand why we need the two bins. 🤷
If I sum up, we want col_data to ignore implicit zeros (do we want explicit one ?) and have the same logic as the dense one from there.
For transform we don't want to map implicit zeros to actual_bin_zero so we get rid of the shifting process.
Maybe we want to shift every bin by one so the implicit zeros don't over-populate the bin 0 ?

I agree we don't need the 2 additional bin frontiers. We should just compute the bin frontiers on all non-zeros values, and then pass these bin frontiers to a sparse transform. The sparse transform would be in charge of mapping the zeros to 0, and map the rest of the values into [1, upper_bound] (where upper_bound is usually 255 (or 254, I don't remember, 255 might be reserved for missing values)). This mapping for the non-zero values can use the same binary search algorithm as for the dense case.

StealthyKamereon · 2021-01-19T14:21:48Z

Okay, everything that we discussed is done.
I voluntarily didn't added much doc for now as the content of this PR can greatly change depending on your remarks.
I'll add this at the end or when some functionality is stabilized.

thomasjpfan

As commented, I see the current implementation bin mapper having significance memory overhead.

If we go down the native approach, then we need the tree building to operate on a csc matrix. The csc matrix would represent the binned data. For prediction, we would need to operate on csr matrices. These csr matrices would be unbinned.

For the sparse case, I think we need Exclusive Feature Bundling as described in section 4 of the LightGBM paper which is enabled by default in LightGBM. I would be open to having this in a future PR.

I think the hard part would be to put sparse matrices into what we have now for tree building and prediction. I think prediction would be a little easier because worst case, we can create a SparseTreePredictor.

For the tree building, I think we need to create separate objects for Histogram Building and the Splitter for handling sparse data. Luckily the histograms can still be dense, so the loss computation is okay.

Going back to the binner, I think it would be cleaner to have a sparse binner as well. I think mixing in too much dense code with sparse code makes things harder to maintain.

thomasjpfan · 2021-04-09T00:20:37Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+
+
+def _unpack_matrix(dense_matrix, row_indices, shape):
+    sparse_matrix = sp.lil_matrix(shape, dtype=dense_matrix.dtype)


From my experience, using lil_matrix to construct medium size matrices and then converting to a tocsr is not great time wise and memory wise.

If we are building a binned matrix, we know for sure that it will have the same nonzero elements as the input sparse matrix. We can construct a the output by copying over the indptr and indices, and setting the data to be the same shape but with the binned dtype.

thomasjpfan · 2021-04-09T00:22:14Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

@@ -17,6 +18,52 @@
 from ._bitset import set_bitset_memoryview


+def _pack_matrix(sparse_matrix):


We can convert into a csc feeding it into _BinMapper and then build the bins from there.

NicolasHug · 2021-04-09T06:35:46Z

For the tree building, I think we need to create separate objects for Histogram Building and the Splitter for handling sparse data. Luckily the histograms can still be dense, so the loss computation is okay.

I might have missed things in #16885, but I don't think we need to change the splitter since as you said, the histograms will still be dense, and the splitter only needs the histograms. Also I agree that the loss computation will be unchanged but it's not related to histograms: the loss computation only needs the true and predicted targets arrays.

thomasjpfan · 2021-04-09T20:46:26Z

I am thinking about how Splitter.split_indices needs to work with a sparse X_binned to do the actual splitting. Looking over the code, most of the Splitter is "split finding" and split_indices does the actual splitting. I think split_indices can split into two classes: DenseSplitter and SparseSplitter. The split finding parts of current Splitter can become SplitFinder.

NicolasHug · 2021-04-10T09:03:21Z

In split_indices, the only use of X_binned is for X_binned[sample_idx] (where X_binned is aliased to a column). It should work fine with sparse matrices as long as we use csc for X_binned as you suggested above?

thomasjpfan · 2021-04-10T20:39:25Z

It should work fine with sparse matrices as long as we use csc for X_binned as you suggested above?

Since we are in the Cython level with no gil, a sparse X_binned is represented be three numpy arrays data, indices, and indptr. The first slicing operation: self.X_binned[:, feature_idx] is a simple slice. When we need X_binned[sample_idx], we need a binary search over a slice of indices (assuming the indices are sorted).

So there could be a abstraction for X_binned, one backed by a dense array and another backed by a sparse matrix. One would need to be careful that this abstraction does not add any overhead for the dense case.

MaelGodey added 2 commits January 16, 2021 22:56

Added crude support for sparse matrix in HistGradientBoostingClassifier

bdbf3f2

Merge remote-tracking branch 'upstream/master' into add_sparse_matrix…

0f0b557

…_support_for_histgradientboostingclassifier

github-actions bot added the module:ensemble label Jan 16, 2021

NicolasHug reviewed Jan 17, 2021

View reviewed changes

MaelGodey added 2 commits January 17, 2021 12:06

Added tests to ensure _find_binning_threshold_integrity

6339472

Fixed computational mistake on _find_binning_thresholds with sparse data

bb7b166

Base automatically changed from master to main January 22, 2021 10:53

thomasjpfan reviewed Apr 9, 2021

View reviewed changes

Personman000 added a commit to Personman000/CSCD01-Backup that referenced this pull request Oct 13, 2021

Added functionality from scikit-learn/scikit-learn#19187

79bb1b4

cmarmo added Stalled help wanted labels Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add sparse matrix support for histgradientboostingclassifier #19187

[WIP] Add sparse matrix support for histgradientboostingclassifier #19187

StealthyKamereon commented Jan 16, 2021

NicolasHug Jan 17, 2021 •

edited

Loading

StealthyKamereon Jan 17, 2021

NicolasHug Jan 17, 2021

StealthyKamereon Jan 17, 2021 •

edited

Loading

NicolasHug Jan 17, 2021

StealthyKamereon Jan 19, 2021

NicolasHug Jan 19, 2021

StealthyKamereon Jan 19, 2021

NicolasHug Jan 19, 2021

StealthyKamereon commented Jan 19, 2021

thomasjpfan left a comment

thomasjpfan Apr 9, 2021

thomasjpfan Apr 9, 2021

NicolasHug commented Apr 9, 2021

thomasjpfan commented Apr 9, 2021

NicolasHug commented Apr 10, 2021

thomasjpfan commented Apr 10, 2021



		def _unpack_matrix(dense_matrix, row_indices, shape):
		sparse_matrix = sp.lil_matrix(shape, dtype=dense_matrix.dtype)

		@@ -17,6 +18,52 @@
		from ._bitset import set_bitset_memoryview


		def _pack_matrix(sparse_matrix):

[WIP] Add sparse matrix support for histgradientboostingclassifier #19187

Are you sure you want to change the base?

[WIP] Add sparse matrix support for histgradientboostingclassifier #19187

Conversation

StealthyKamereon commented Jan 16, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Fit

Transform

Packing / Unpacking

Any other comments?

NicolasHug Jan 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StealthyKamereon Jan 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StealthyKamereon commented Jan 19, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Apr 9, 2021

thomasjpfan commented Apr 9, 2021

NicolasHug commented Apr 10, 2021

thomasjpfan commented Apr 10, 2021

NicolasHug Jan 17, 2021 •

edited

Loading

StealthyKamereon Jan 17, 2021 •

edited

Loading