-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Refactor Splitter
into a BaseSplitter
#25101
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Adam Li <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
@adam2392 this builds and passes all the tree test from pytest. Though, we get some errors with k-means which I'm still trying to troubleshoot |
What is the error from kmeans? |
Turns out if you accidently put one extra |
@adam2392 The rest of the log is in [this doc] sklearn/cluster/tests/test_k_means.py:390: AssertionError
_____________ test_kmeans_verbose[0-elkan] _____________
algorithm = 'elkan', tol = 0
capsys = <_pytest.capture.CaptureFixture object at 0x7fe92c5295e0>
@pytest.mark.parametrize("algorithm", ["lloyd", "elkan"])
@pytest.mark.parametrize("tol", [1e-2, 0])
def test_kmeans_verbose(algorithm, tol, capsys):
# Check verbose mode of KMeans for better coverage.
X = np.random.RandomState(0).normal(size=(5000, 10))
KMeans(
algorithm=algorithm,
n_clusters=n_clusters,
random_state=42,
init="random",
n_init=1,
tol=tol,
verbose=1,
).fit(X)
captured = capsys.readouterr()
assert re.search(r"Initialization complete", captured.out)
assert re.search(r"Iteration [0-9]+, inertia", captured.out)
if tol == 0:
> assert re.search(r"strict convergence", captured.out)
E AssertionError: assert None
E + where None = <function search at 0x7fe9c8b85160>('strict convergence', 'Initialization complete\nIteration 0, inertia 64759.062154957086\nIteration 1, inertia 46751.170526691916\nIteration ... 297, inertia 45634.20523994775\nIteration 298, inertia 45040.45016538214\nIteration 299, inertia 47499.921554245804\n')
E + where <function search at 0x7fe9c8b85160> = re.search
E + and 'Initialization complete\nIteration 0, inertia 64759.062154957086\nIteration 1, inertia 46751.170526691916\nIteration ... 297, inertia 45634.20523994775\nIteration 298, inertia 45040.45016538214\nIteration 299, inertia 47499.921554245804\n' = CaptureResult(out='Initialization complete\nIteration 0, inertia 64759.062154957086\nIteration 1, inertia 46751.170526...rtia 45634.20523994775\nIteration 298, inertia 45040.45016538214\nIteration 299, inertia 47499.921554245804\n', err='').out
sklearn/cluster/tests/test_k_means.py:390: AssertionError
__________________ test_warm_start[1] __________________
seed = 1
@pytest.mark.filterwarnings("ignore:.*did not converge.*")
@pytest.mark.parametrize("seed", (0, 1, 2))
def test_warm_start(seed):
random_state = seed
rng = np.random.RandomState(random_state)
n_samples, n_features, n_components = 500, 2, 2
X = rng.rand(n_samples, n_features)
# Assert the warm_start give the same result for the same number of iter
g = GaussianMixture(
n_components=n_components,
n_init=1,
max_iter=2,
reg_covar=0,
random_state=random_state,
warm_start=False,
)
h = GaussianMixture(
n_components=n_components,
n_init=1,
max_iter=1,
reg_covar=0,
random_state=random_state,
warm_start=True,
)
g.fit(X)
score1 = h.fit(X).score(X)
score2 = h.fit(X).score(X)
> assert_almost_equal(g.weights_, h.weights_)
E AssertionError:
E Arrays are not almost equal to 7 decimals
E
E Mismatched elements: 2 / 2 (100%)
E Max absolute difference: 0.01287114
E Max relative difference: 0.0258317
E x: array([0.5111401, 0.4888599])
E y: array([0.498269, 0.501731])
sklearn/mixture/tests/test_gaussian_mixture.py:841: AssertionError
= 15 failed, 24121 passed, 3119 skipped, 79 xfailed, 43 xpassed, 2437 warnings in 511.97s (0:08:31) =
|
Seems unrelated. You did Make clean And then rebuild? |
@adam2392 I built from scratch (i.e., creating new conda env) when I benchmarked, but for these, I just worked on my env (i.e., without creating a new env every time I build). So I just created a new env to see, and it seems like it's throwing the same error 😞 I'm not sure if it's relevant, but I forked from your Update - I tested by bringing the branch up to date with |
Sure. Could've been resolved on main. It looks like there's something going on cuz most of the CI checks work. |
@adam2392 I just ran the same test on my other computer, and it's able to pass all test without any errors (both current branch and Tl; dr - It's fixed! = 24136 passed, 3119 skipped, 79 xfailed, 43 xpassed, 2434 warnings in 524.07s (0:08:44) = |
@jshinm please update the PR description summarizing the changes and also add the results of running asv benchmarks into the PR description. |
@jjerphan this looks ready for an initial look if you're avail. regarding LOC diff, the criterion PR needs to be merged first for this PR to make sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor docstring changes.
Co-authored-by: Adam Li <[email protected]>
Happy New Year, @adam2392 and @jshinm. Thank you for reporting the benchmarks' results. I would wait for #24678 to merged before pursuing this PR so as to not go back and forth between changes. If you want you can in the meantime:
|
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
I've been doing some testing downstream to test how trees can be refactored. I realized that due to the inherent linkage between Splitter and Criterion, the current code would not work in practice. I will add here for reference. See: cython/cython#5226 (comment). I think the issue is masked for some reason here possibly because TreeBuilder still assumes a |
Thanks for the pointers. I will come soon to this. |
Reference Issues/PRs
Fixes #24990
What does this implement/fix? Explain your changes.
Adds the BaseSplitter class as an abstract class to be inherited by Splitter class which is now modularized without assuming "supervised learning" setting. To achieve this, moderate refactoring was done on Criterion by separating resetting of the index pointers from initialization process by creating
set_sample_pointers
method that is called by child classes of the Criterion.This change is backwards compatible.
ASV Benchmark result
Test Machine Spec
Any other comments?