-
-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Improve and make consistent scoring
parameter docstrings
#30319
Conversation
sklearn/linear_model/_ridge.py
Outdated
- `None`: negative :ref:`mean squared error <mean_squared_error>` if cv is | ||
None (i.e. when using leave-one-out cross-validation), and | ||
:ref:`coefficient of determination <r2_score>` (:math:`R^2`) otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanieSenger I get what you meant when you said Ridge is confusing. cv
can no longer be 'auto' (I assume thats been deprecated) so removing that helps.
Essentially when using cv=None
, we use the default efficient LOO cv, in which case negative mean squared error is used. R2 is used for all other values of cv
. Now that I've figured out what this means, it makes sense as it is and I can't think of a better way to put it without making it too long. Suggestions welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That reads much clearer. :)
I am wondering, since there is no deprecation notification on cv
for 'auto'
, if the gcv_mode
is meant with it. In this case it would have to be mentioned here.
Edit: just checked: the reference to 'auto'
first appeared in the 0.21 release (link), but cv didn't have the option 'auto'
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah no idea where 'auto' came from.
if the gcv_mode is meant with it.
I don't think so because I think mean squared error is the default no matter what gcv mode is:
scikit-learn/sklearn/linear_model/_ridge.py
Lines 2174 to 2175 in caaa1f5
squared_errors = (c / G_inverse_diag) ** 2 | |
alpha_score = self._score_without_scorer(squared_errors=squared_errors) |
also 'auto' could be either 'svd' or 'eigen' so it doesn't make sense to say score is only a specific thing when gcv_mode
is auto.
sklearn/feature_selection/_rfe.py
Outdated
@@ -553,7 +553,7 @@ class RFECV(RFE): | |||
|
|||
The number of features selected is tuned automatically by fitting an :class:`RFE` | |||
selector on the different cross-validation splits (provided by the `cv` parameter). | |||
The performance of the :class:`RFE` selector are evaluated using `scorer` for | |||
The performance of the :class:`RFE` selectors are evaluated using `scoring` for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a selector
for each cv fold, so I think plural is right here
sklearn/base.py
Outdated
@@ -545,7 +545,7 @@ def __sklearn_tags__(self): | |||
|
|||
def score(self, X, y, sample_weight=None): | |||
""" | |||
Return the mean accuracy on the given test data and labels. | |||
Return mean :ref:`accuracy <accuracy_score>` on test data and labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking again about the use of the word 'mean' again, maybe it is used to indicate it is not the count (which you get with normalize=False
) but the fraction/'mean' accuracy across the test set?
Following on from our conversation in #30316 (comment)
I do still agree that 'mean' is confusing. Not sure what a better solution would be though, 'fraction accuracy' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seems so to me as well.
But accuracy is defined (acc = (TP+TN)/(all)) as a normalized fraction anyways and I think we don't need to provide the implementation details to the users here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed 'mean', but happy to change
This is ready for review now #30316 is merged, maybe @StefanieSenger and @ArturoAmorQ may be interested, if you are not tired of scoring! Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR, @lucyleeow. It will help our users find out about the options for scoring.
I left some suggestions.
sklearn/base.py
Outdated
@@ -545,7 +545,7 @@ def __sklearn_tags__(self): | |||
|
|||
def score(self, X, y, sample_weight=None): | |||
""" | |||
Return the mean accuracy on the given test data and labels. | |||
Return mean :ref:`accuracy <accuracy_score>` on test data and labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seems so to me as well.
But accuracy is defined (acc = (TP+TN)/(all)) as a normalized fraction anyways and I think we don't need to provide the implementation details to the users here.
``scorer(estimator, X, y)``. See :ref:`scoring_callable` for details. | ||
- `None`: the :ref:`coefficient of determination <r2_score>` | ||
(:math:`R^2`) is used. | ||
- 'loss': early stopping is checked w.r.t the loss value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 'loss': early stopping is checked w.r.t the loss value. | |
- 'loss': efficiently compute score with respect to the loss value | |
at each iteration. |
Maybe like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe I don't understand this parameter. I thought this was the metric to use to see if we should stop early, it's not a score to return..? In which case 'compute score' seems misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but what I was thinking was that the scoring
to be used for the predictions is computed from the loss somehow during fit time, for efficiency.
But true: the code is early stopping based on the scores archived and it doesn't seem to have anything to do with predict(). I am more confused about this than before. Please, feel free to disregard my suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am more confused about this than before.
The meat of the early stopping code seems to be here:
scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
Lines 730 to 756 in ebfda90
if self.scoring == "loss": | |
# we're going to compute scoring w.r.t the loss. As losses | |
# take raw predictions as input (unlike the scorers), we | |
# can optimize a bit and avoid repeating computing the | |
# predictions of the previous trees. We'll reuse | |
# raw_predictions (as it's needed for training anyway) for | |
# evaluating the training loss. | |
self._check_early_stopping_loss( | |
raw_predictions=raw_predictions, | |
y_train=y_train, | |
sample_weight_train=sample_weight_train, | |
raw_predictions_val=raw_predictions_val, | |
y_val=y_val, | |
sample_weight_val=sample_weight_val, | |
n_threads=n_threads, | |
) | |
else: | |
self._scorer = check_scoring(self, self.scoring) | |
# _scorer is a callable with signature (est, X, y) and | |
# calls est.predict() or est.predict_proba() depending on | |
# its nature. | |
# Unfortunately, each call to _scorer() will compute | |
# the predictions of all the trees. So we use a subset of | |
# the training set to compute train scores. | |
# Compute the subsample set |
AFAICT, we want to check if there has been improvement in the score of the predictions, and if not we stop early (before we reach max leaves or something). For 'loss' we just use the loss, otherwise a scorer function is used. Both calculate on predictions, but scorers have the signature (est, X, y)
, so we do it on a subset of train to avoid calling predict constantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after looking at it again I would agree.
The scoring
param here is used differently than in the CV models or in cross validation, which is only for determining how we calculate early stopping (if it is enabled). It's the parameter name, that is confusing. But this cannot be changed so easily, so we need to live with it.
Going back to the documentation I see that it doesn't promise anything more than the scoring is used in early stopping: Scoring parameter to use for early stopping.
Still, I wonder if this misunderstanding could be prevented for other people by rewording into Scoring to use for calculating early stopping.
Or alternatively add as a last sentence, that this scoring
param differs from the way scoring is defined in the Glossary. But it's also good as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. I've altered it slightly to "Scoring method to use ..." because the 'parameter' part seemed redundant and possibly confusing.
I think it is clear enough in the parameter docstring that this is used only for early stopping. What we could do though is add an extra sentence in the glossary about how scoring
is sometimes used to specify scoring method to use for early stopping as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we could do though is add an extra sentence in the glossary about how scoring is sometimes used to specify scoring method to use for early stopping as well?
Oh yes, that's good. 👍
sklearn/feature_selection/_rfe.py
Outdated
The performance of the :class:`RFE` selectors are evaluated using `scoring` for | ||
different number of selected features and aggregated together. Finally, the scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance of the :class:`RFE` selectors are evaluated using `scoring` for | |
different number of selected features and aggregated together. Finally, the scores | |
The performance of the :class:`RFE` selectors is evaluated using `scoring` for | |
different numbers of selected features and aggregated together. Finally, the scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is
sounds wrong, what about just:
The performance of the :class:`RFE` selectors are evaluated using `scoring` for | |
different number of selected features and aggregated together. Finally, the scores | |
The performance of each :class:`RFE` selector is evaluated using `scoring` for | |
different numbers of selected features and aggregated together. Finally, the scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's perfect! :)
NOTE that when using a custom scorer, it should return a single | ||
value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constraint could be integrated with the description of callable a few lines above:
"callable with [signature ...], that returns a single value"
sklearn/linear_model/_logistic.py
Outdated
a scorer callable object / function with signature | ||
``scorer(estimator, X, y)``. For a list of scoring functions | ||
that can be used, look at :mod:`sklearn.metrics`. | ||
scoring : str or callable or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scoring : str or callable or None | |
scoring : str or callable, default=None |
Should we also mention the default type here, similar as in the other parts of the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no default for this one (private func), but I have looked at the rest and made them consistent, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your further improvements, @lucyleeow!
I went over the PR again and found some minor things, that could be improved. Otherwise I am very happy. I had been thinking about how we could be documenting the scoring better for some time before, because it was a bit obscure. But low it looks very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and sorry for taking so long to review @lucyleeow! Here are a couple of tweaks but the general status is already a big improvement :)
the test set. If `None`, the | ||
:ref:`default evaluation criterion <scoring_api_overview>` of the estimator | ||
is used. | ||
the test set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "test set" here makes me feel as if the user had to provided the test set coming from a train-test split. I guess "validation set" or just "left out data" would be more appropriate, but maybe it's only me. Alternatively, we could use a wording similar to
Strategy to evaluate the performance of the model across cross-validation splits.
If you agree, this comment also applies to cross_val_score
and maybe slightly adapt it for permutation_test_score
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like:
Strategy to evaluate the performance of the model across cross-validation splits.
I've altered to use estimator
, to reference the estimator
parameter:
Strategy to evaluate the performance of the estimator
across cross-validation splits.
Am happy to change though.
For permutation_test_score
I have used "validation set", as we use CV to generate the null hypothesis distribution as well, so I thought "across cross-validation splits" was not as fitting. Happy to change.
Thanks for the review @ArturoAmorQ ! I've made the changes but happy to change anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lucyleeow! Merging!
Reference Issues/PRs
Follow on from #30303
Builds (essentially branches) from #30316 so keeping this as draft until that one goes in.
What does this implement/fix? Explain your changes.
scoring
options as bullet pointsNone
doesAny other comments?