Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add sparse matrix support for histgradientboostingclassifier #19187

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

StealthyKamereon
Copy link

This pull request adds a crude support of sparse matrix for HistGradientBoostingClassifier.
The handling of the sparse matrix is rather crude to minimize the PR and avoid touching cython for now.

Reference Issues/PRs

Fixes #15336

What does this implement/fix? Explain your changes.

It implements sparse matrix support for HistGradientBoostingClassifier.
There are two major differences when handling sparse matrix : during fit and during transform.

Fit

At fitting time, the sparse matrix can be safely converted to a dense one because only unique values are important.
This is done by getting the non-null values and then appending 0 to the list of values.
The matrix can now be treated the same way a dense one is.

That way, it avoids duplicating the code and having to test a new algorithm.

Transform

Transform is more tricky.
The main problem here (as @NicolasHug explained) is that we want the bin with value 0 to be mapped to the value 0.
That way we can take advantage of the sparse structure to treat missing values as 0.
In the returned matrix, values in the bins [0, actual_bin_zero) are right-shifted by one and the bin actual_bin_zero is mapped to 0.
The transform function returns actual_bin_zero alongside the matrix.

The second short-term problem is that the core of the binning logic is done in _map_to_bins which is a cython function.
I prefer not having to touch to the cython code for now so I worked around it.

The process is the following :

  1. We pass a 1x1 zero-filled matrix to _map_to_bins to determine the actual_bin_zero.
  2. We pack the sparse matrix into a dense one (process described in details bellow)
  3. We pass the packed matrix to _map_to_bins to map the bins.
  4. We shift the mapped bins.
  5. We unpack the dense matrix to a sparse one, conserving the original structure.
Packing / Unpacking

To allow _map_to_bins to "process" a sparse matrix, we pack it into a dense one.
This is done column wise and the resulting matrix is the smallest possible that can fit the data.
The function returns the packed matrix and the row indices of the former values.

Unpacking the matrix is the exact opposite : we use the row indices to put the values into the correct position in the sparse matrix.

Example :

>>> a = csr_matrix(array(
      [[   1, 1000,    0],
       [   0,  999,   -1],
       [   2,    0,   -2],
       [   3,    0,   -3],
       [   0,    0,   -4]]))
>>> _pack_matrix(a)
(
array([[   1, 1000,   -1],
       [   2,  999,   -2],
       [   3,    0,   -3],
       [   0,    0,   -4]]), 
array([[ 0.,  0.,  1.],
       [ 2.,  1.,  2.],
       [ 3., nan,  3.],
       [nan, nan,  4.]])
)

Any other comments?

The main performance bottleneck is the packing / unpacking of the matrix.
I kept that fact in mind and always used the sparsity of the matrix to my advantage.

In the future, the cython code need to be retouched to make place to a cleaner flow.

Comment on lines 90 to 95
# Since col_data is only used for get unique values
# and computing percentile, zeros don't matter
values = col_data[col_data.nonzero()[0]].toarray()
if 0 in col_data:
values = np.append(values, 0)
col_data = values
Copy link
Member

@NicolasHug NicolasHug Jan 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @StealthyKamereon .

Do I understand correctly that this essentially removes all zeros but one within the column?

If that's the case, then the behaviour won't be equivalent to the dense case:

#dense case 
col_data = np.random.randn(1000)
col_data[:900] = 0

max_bins = 30
percentiles = np.linspace(0, 100, num=max_bins + 1)
percentiles = percentiles[1:-1]

np.percentile(col_data, percentiles, interpolation='midpoint')
# gives
array([-0.32659394,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.5365524 ])

!=

# proposed sparse case
col_data2 = col_data[np.flatnonzero(col_data)]
col_data2 = np.append(col_data2, 0)

np.percentile(col_data2, percentiles, interpolation='midpoint'
# gives
array([-1.59708717, -1.28013648, -1.06639733, -0.93588368, -0.86844474,
       -0.77952759, -0.67911702, -0.61731544, -0.5256898 , -0.32659394,
       -0.19481156, -0.05218316,  0.00542425,  0.0535143 ,  0.1421762 ,
        0.24276456,  0.27940835,  0.36767042,  0.44821144,  0.5365524 ,
        0.64450162,  0.67886618,  0.77848088,  0.87048614,  1.08061173,
        1.25631194,  1.50380088,  1.60707389,  1.88223662])

col_data2 basically corresponds to what this PR does to the sparse case and it's very different from the above dense case.

In any case, it would be good to have a test ensuring the consistency of the output for the dense and sparse cases (with various max_bins, sparsity levels, and number of unique values)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I naively thought that the output would be the same but it's definitely not the case.
I added some tests as suggested but concerning the computation, does that mean we just want to densify col_data ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think more about this, it might actually make sense to not have an equivalence in the bin frontiers computation between the sparse and the dense case.

If a column has tons of zeros, just computing the percentiles would lead to non-informative bin frontiers as most of them would be zeros (see my updated message above with the actual values).

OTOH if we first remove all zeros we'll have better bins frontiers... but we'd be missing 2 important ones: 0 - eps and 0 + eps.

I know that LightGBM manually forces these 2 bins frontiers to exist (whether the matrix is sparse or not BTW, but that's irrelevant).

Maybe a good strat for the sparse case would be to do almost as you did but:

  • compute percentiles on the densified non-zero values without re-inserting a 0, and ask for n - 2 midpoints instead of n
  • manually insert 0 - eps and 0 + eps (simple binary search would do here) so we end up with the required n mid points.

The equivlence between sparse and dense case would not hold, but it actually looks more sensible this way. Thoughts @StealthyKamereon @ogrisel @lorentzenchr ?

Copy link
Author

@StealthyKamereon StealthyKamereon Jan 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is not very visible on our dummy examples but if we look at the sparsity of the matrix generated for #15336, we can imagine that every useful information will be squeezed up.
What do you suggest for eps ? Something like 1e-7 or bigger ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.finfo(X_DTYPE).eps should be what we need here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then what about transform ?
If we computed the bins ignoring all the implicit zeros, it feels wrong mapping those zeros to the actual_bin_zero 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm indeed, transform should definitely not use the same code as for the dense case and indeed zeros should be mapped to bin 0. In this case we might not need to insert the 2 additional bin frontiers, and just let the sparse transorm properly deal with the zeros? I guess the sparse transform code would look like:

if x is an implicit 0:
	map to bin 0
else:
	bin = use same binary search as in dense transform
	map to bin + 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need the two bins. 🤷
If I sum up, we want col_data to ignore implicit zeros (do we want explicit one ?) and have the same logic as the dense one from there.
For transform we don't want to map implicit zeros to actual_bin_zero so we get rid of the shifting process.
Maybe we want to shift every bin by one so the implicit zeros don't over-populate the bin 0 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we don't need the 2 additional bin frontiers. We should just compute the bin frontiers on all non-zeros values, and then pass these bin frontiers to a sparse transform. The sparse transform would be in charge of mapping the zeros to 0, and map the rest of the values into [1, upper_bound] (where upper_bound is usually 255 (or 254, I don't remember, 255 might be reserved for missing values)). This mapping for the non-zero values can use the same binary search algorithm as for the dense case.

@StealthyKamereon
Copy link
Author

Okay, everything that we discussed is done.
I voluntarily didn't added much doc for now as the content of this PR can greatly change depending on your remarks.
I'll add this at the end or when some functionality is stabilized.

Base automatically changed from master to main January 22, 2021 10:53
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented, I see the current implementation bin mapper having significance memory overhead.

If we go down the native approach, then we need the tree building to operate on a csc matrix. The csc matrix would represent the binned data. For prediction, we would need to operate on csr matrices. These csr matrices would be unbinned.

For the sparse case, I think we need Exclusive Feature Bundling as described in section 4 of the LightGBM paper which is enabled by default in LightGBM. I would be open to having this in a future PR.

I think the hard part would be to put sparse matrices into what we have now for tree building and prediction. I think prediction would be a little easier because worst case, we can create a SparseTreePredictor.

For the tree building, I think we need to create separate objects for Histogram Building and the Splitter for handling sparse data. Luckily the histograms can still be dense, so the loss computation is okay.

Going back to the binner, I think it would be cleaner to have a sparse binner as well. I think mixing in too much dense code with sparse code makes things harder to maintain.



def _unpack_matrix(dense_matrix, row_indices, shape):
sparse_matrix = sp.lil_matrix(shape, dtype=dense_matrix.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience, using lil_matrix to construct medium size matrices and then converting to a tocsr is not great time wise and memory wise.

If we are building a binned matrix, we know for sure that it will have the same nonzero elements as the input sparse matrix. We can construct a the output by copying over the indptr and indices, and setting the data to be the same shape but with the binned dtype.

@@ -17,6 +18,52 @@
from ._bitset import set_bitset_memoryview


def _pack_matrix(sparse_matrix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can convert into a csc feeding it into _BinMapper and then build the bins from there.

@NicolasHug
Copy link
Member

For the tree building, I think we need to create separate objects for Histogram Building and the Splitter for handling sparse data. Luckily the histograms can still be dense, so the loss computation is okay.

I might have missed things in #16885, but I don't think we need to change the splitter since as you said, the histograms will still be dense, and the splitter only needs the histograms. Also I agree that the loss computation will be unchanged but it's not related to histograms: the loss computation only needs the true and predicted targets arrays.

@thomasjpfan
Copy link
Member

I am thinking about how Splitter.split_indices needs to work with a sparse X_binned to do the actual splitting. Looking over the code, most of the Splitter is "split finding" and split_indices does the actual splitting. I think split_indices can split into two classes: DenseSplitter and SparseSplitter. The split finding parts of current Splitter can become SplitFinder.

@NicolasHug
Copy link
Member

In split_indices, the only use of X_binned is for X_binned[sample_idx] (where X_binned is aliased to a column). It should work fine with sparse matrices as long as we use csc for X_binned as you suggested above?

@thomasjpfan
Copy link
Member

It should work fine with sparse matrices as long as we use csc for X_binned as you suggested above?

Since we are in the Cython level with no gil, a sparse X_binned is represented be three numpy arrays data, indices, and indptr. The first slicing operation: self.X_binned[:, feature_idx] is a simple slice. When we need X_binned[sample_idx], we need a binary search over a slice of indices (assuming the indices are sorted).

So there could be a abstraction for X_binned, one backed by a dense array and another backed by a sparse matrix. One would need to be careful that this abstraction does not add any overhead for the dense case.

Personman000 added a commit to Personman000/CSCD01-Backup that referenced this pull request Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Sparse Matrix Support For HistGradientBoostingClassifier
5 participants