-
-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN Refactors find binning threshold #18395
CLN Refactors find binning threshold #18395
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise LGTM
@@ -16,76 +16,53 @@ | |||
from .common import X_DTYPE, X_BINNED_DTYPE, ALMOST_INF | |||
|
|||
|
|||
def _find_binning_thresholds(data, max_bins, subsample, random_state): | |||
def _find_binning_threshold(col_data, max_bins): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should still have an s
:
def _find_binning_threshold(col_data, max_bins): | |
def _find_binning_thresholds(col_data, max_bins): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the following edge case that would need to be handled and tested, LGTM. Because it's unrelated to this PR, let's merge.
missing_mask = np.isnan(col_data) | ||
if missing_mask.any(): | ||
col_data = col_data[~missing_mask] | ||
col_data = np.ascontiguousarray(col_data, dtype=X_DTYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a problem if a column has 100% missing values. But this problem is already present in master.
Reference Issues/PRs
Related to #16909 and #18394
What does this implement/fix? Explain your changes.
This PR is used to make the above PRs easier to review.
CC @NicolasHug