-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP022: Fixing randomness ambiguities #88
base: main
Are you sure you want to change the base?
Conversation
…s into random_state
Co-authored-by: Adrin Jalali <[email protected]>
…osals into random_state
…into random_state_v2
I'd like to take on this SLEP if it is Ok with you. |
Sure, happy for you to take over @betatim , thanks! |
|
||
|
||
Abstract | ||
======== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that we make these changes in behaviour together with the switch to the new Numpy random infrastructure. Combined with the plan of random_state=
to rng=
transition for the argument name.
if 'random_state' not in est.get_params(): | ||
raise ValueError("This estimator isn't random and can only have exact clones") | ||
|
||
def cross_val_score(est, X, y, cv, use_exact_clones=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename use_exact_clones
so that statistical clones are allowed but it is fine to have deterministic estimators as well.
Maybe allow_statistical_clone=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick notes based on our drafting meeting discussion:
This means that ideally, calling `est.fit(X, y)` should yield the same model | ||
twice. We have a check for that in the `check_estimator()` suite: | ||
`check_fit_idempotent()`. Clearly, this fit-idempotency property is violated | ||
when None is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when None is used. | |
when None or a `np.random.RandomState` instance is used. |
twice. We have a check for that in the `check_estimator()` suite: | ||
`check_fit_idempotent()`. Clearly, this fit-idempotency property is violated | ||
when None is used. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we could improve our doc and states that scikit-learn compatible estimator fit results should be at least "statistically idempotent" (or "idempotent in expected value") unless and random_state is passed as an integer seed, in which case they should be strictly idempotent.
I think this is a sufficient requirement.
are using. | ||
|
||
This also introduces a private attribute, so we would need more intrusive | ||
changes to `set_params`, `get_params`, and `clone`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could alternatively explore the possibility to do stateless consumption in fit
instead of seeding in __init__
.
def fit(self): # or split()
rng = check_random_state(self.random_state, copy=True)
...
This way we would still enforce the stateless fit semantics for the est.random_state
attribute.
I will try to find the time to explore this option an alternative notebook in a gist.
Thanks for working on this! I think these clarification in the docs and the updated behavior would be really helpful to reduce ambiguity. I noticed that in scikit-learn/scikit-learn#26148 (comment) point 3 there was a mention that the new NumPy Generators would support a different behavior than the old RandomState objects (if I understood correctly). I didn't see a mention of that distinction in this proposal, is this difference still planned to be implemented? |
This SLEP is NOT the same as its predecessor #24
Disclaimer: I won't be able to champion this SLEP.
I'm opening it here now because I hope it can help better framing the discussions happening in scikit-learn/scikit-learn#26148.