-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add sparse matrix support for histgradientboostingclassifier #19187
base: main
Are you sure you want to change the base?
[WIP] Add sparse matrix support for histgradientboostingclassifier #19187
Conversation
…_support_for_histgradientboostingclassifier
# Since col_data is only used for get unique values | ||
# and computing percentile, zeros don't matter | ||
values = col_data[col_data.nonzero()[0]].toarray() | ||
if 0 in col_data: | ||
values = np.append(values, 0) | ||
col_data = values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @StealthyKamereon .
Do I understand correctly that this essentially removes all zeros but one within the column?
If that's the case, then the behaviour won't be equivalent to the dense case:
#dense case
col_data = np.random.randn(1000)
col_data[:900] = 0
max_bins = 30
percentiles = np.linspace(0, 100, num=max_bins + 1)
percentiles = percentiles[1:-1]
np.percentile(col_data, percentiles, interpolation='midpoint')
# gives
array([-0.32659394, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.5365524 ])
!=
# proposed sparse case
col_data2 = col_data[np.flatnonzero(col_data)]
col_data2 = np.append(col_data2, 0)
np.percentile(col_data2, percentiles, interpolation='midpoint'
# gives
array([-1.59708717, -1.28013648, -1.06639733, -0.93588368, -0.86844474,
-0.77952759, -0.67911702, -0.61731544, -0.5256898 , -0.32659394,
-0.19481156, -0.05218316, 0.00542425, 0.0535143 , 0.1421762 ,
0.24276456, 0.27940835, 0.36767042, 0.44821144, 0.5365524 ,
0.64450162, 0.67886618, 0.77848088, 0.87048614, 1.08061173,
1.25631194, 1.50380088, 1.60707389, 1.88223662])
col_data2 basically corresponds to what this PR does to the sparse case and it's very different from the above dense case.
In any case, it would be good to have a test ensuring the consistency of the output for the dense and sparse cases (with various max_bins, sparsity levels, and number of unique values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I naively thought that the output would be the same but it's definitely not the case.
I added some tests as suggested but concerning the computation, does that mean we just want to densify col_data ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think more about this, it might actually make sense to not have an equivalence in the bin frontiers computation between the sparse and the dense case.
If a column has tons of zeros, just computing the percentiles would lead to non-informative bin frontiers as most of them would be zeros (see my updated message above with the actual values).
OTOH if we first remove all zeros we'll have better bins frontiers... but we'd be missing 2 important ones: 0 - eps
and 0 + eps
.
I know that LightGBM manually forces these 2 bins frontiers to exist (whether the matrix is sparse or not BTW, but that's irrelevant).
Maybe a good strat for the sparse case would be to do almost as you did but:
- compute percentiles on the densified non-zero values without re-inserting a 0, and ask for
n - 2
midpoints instead ofn
- manually insert
0 - eps
and0 + eps
(simple binary search would do here) so we end up with the requiredn
mid points.
The equivlence between sparse and dense case would not hold, but it actually looks more sensible this way. Thoughts @StealthyKamereon @ogrisel @lorentzenchr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is not very visible on our dummy examples but if we look at the sparsity of the matrix generated for #15336, we can imagine that every useful information will be squeezed up.
What do you suggest for eps
? Something like 1e-7
or bigger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.finfo(X_DTYPE).eps
should be what we need here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then what about transform
?
If we computed the bins ignoring all the implicit zeros, it feels wrong mapping those zeros to the actual_bin_zero 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm indeed, transform should definitely not use the same code as for the dense case and indeed zeros should be mapped to bin 0. In this case we might not need to insert the 2 additional bin frontiers, and just let the sparse transorm properly deal with the zeros? I guess the sparse transform code would look like:
if x is an implicit 0:
map to bin 0
else:
bin = use same binary search as in dense transform
map to bin + 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need the two bins. 🤷
If I sum up, we want col_data to ignore implicit zeros (do we want explicit one ?) and have the same logic as the dense one from there.
For transform we don't want to map implicit zeros to actual_bin_zero so we get rid of the shifting process.
Maybe we want to shift every bin by one so the implicit zeros don't over-populate the bin 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we don't need the 2 additional bin frontiers. We should just compute the bin frontiers on all non-zeros values, and then pass these bin frontiers to a sparse transform. The sparse transform would be in charge of mapping the zeros to 0, and map the rest of the values into [1, upper_bound] (where upper_bound is usually 255 (or 254, I don't remember, 255 might be reserved for missing values)). This mapping for the non-zero values can use the same binary search algorithm as for the dense case.
Okay, everything that we discussed is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented, I see the current implementation bin mapper having significance memory overhead.
If we go down the native approach, then we need the tree building to operate on a csc
matrix. The csc
matrix would represent the binned data. For prediction, we would need to operate on csr
matrices. These csr
matrices would be unbinned.
For the sparse case, I think we need Exclusive Feature Bundling as described in section 4 of the LightGBM paper which is enabled by default in LightGBM
. I would be open to having this in a future PR.
I think the hard part would be to put sparse matrices into what we have now for tree building and prediction. I think prediction would be a little easier because worst case, we can create a SparseTreePredictor
.
For the tree building, I think we need to create separate objects for Histogram Building and the Splitter for handling sparse data. Luckily the histograms can still be dense, so the loss computation is okay.
Going back to the binner, I think it would be cleaner to have a sparse binner as well. I think mixing in too much dense code with sparse code makes things harder to maintain.
|
||
|
||
def _unpack_matrix(dense_matrix, row_indices, shape): | ||
sparse_matrix = sp.lil_matrix(shape, dtype=dense_matrix.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experience, using lil_matrix
to construct medium size matrices and then converting to a tocsr
is not great time wise and memory wise.
If we are building a binned matrix, we know for sure that it will have the same nonzero elements as the input sparse matrix. We can construct a the output by copying over the indptr and indices, and setting the data to be the same shape but with the binned dtype.
@@ -17,6 +18,52 @@ | |||
from ._bitset import set_bitset_memoryview | |||
|
|||
|
|||
def _pack_matrix(sparse_matrix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can convert into a csc feeding it into _BinMapper
and then build the bins from there.
I might have missed things in #16885, but I don't think we need to change the splitter since as you said, the histograms will still be dense, and the splitter only needs the histograms. Also I agree that the loss computation will be unchanged but it's not related to histograms: the loss computation only needs the true and predicted targets arrays. |
I am thinking about how |
In |
Since we are in the Cython level with no gil, a sparse So there could be a abstraction for |
This pull request adds a crude support of sparse matrix for HistGradientBoostingClassifier.
The handling of the sparse matrix is rather crude to minimize the PR and avoid touching cython for now.
Reference Issues/PRs
Fixes #15336
What does this implement/fix? Explain your changes.
It implements sparse matrix support for HistGradientBoostingClassifier.
There are two major differences when handling sparse matrix : during fit and during transform.
Fit
At fitting time, the sparse matrix can be safely converted to a dense one because only unique values are important.
This is done by getting the non-null values and then appending 0 to the list of values.
The matrix can now be treated the same way a dense one is.
That way, it avoids duplicating the code and having to test a new algorithm.
Transform
Transform is more tricky.
The main problem here (as @NicolasHug explained) is that we want the bin with value 0 to be mapped to the value 0.
That way we can take advantage of the sparse structure to treat missing values as 0.
In the returned matrix, values in the bins [0, actual_bin_zero) are right-shifted by one and the bin actual_bin_zero is mapped to 0.
The transform function returns actual_bin_zero alongside the matrix.
The second short-term problem is that the core of the binning logic is done in _map_to_bins which is a cython function.
I prefer not having to touch to the cython code for now so I worked around it.
The process is the following :
Packing / Unpacking
To allow _map_to_bins to "process" a sparse matrix, we pack it into a dense one.
This is done column wise and the resulting matrix is the smallest possible that can fit the data.
The function returns the packed matrix and the row indices of the former values.
Unpacking the matrix is the exact opposite : we use the row indices to put the values into the correct position in the sparse matrix.
Example :
Any other comments?
The main performance bottleneck is the packing / unpacking of the matrix.
I kept that fact in mind and always used the sparsity of the matrix to my advantage.
In the future, the cython code need to be retouched to make place to a cleaner flow.