Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add Multiclass Brier Score Loss #22046

Open
wants to merge 44 commits into
base: main
Choose a base branch
from

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 21, 2021

Resolves #16055.
This PR updates #18699 by @aggvarun01 after a merge with main and resolves merge conflicts. I do not have the permissions to push directly in the original branch and opening a sub-PR pointing to #18699 would lead to an unreadable diff because of the one-year merge sync.

I also added a changelog entry and demonstrate the new function in the multiclass calibration example.

@aggvarun01 if you want feel free to pull the last commit from this commit from this branch to your branch. Alternatively we can finalize the review here.

@lorentzenchr
Copy link
Member

@ogrisel Do you intend to finish this one? It would be nice to have it prior to #28971.

@ogrisel
Copy link
Member Author

ogrisel commented May 21, 2024

Two strings in the scorer string "binary_brier_score", "multiclass_brier_score"

I guess the scorer strings would instead be "neg_binary_brier_score" and "neg_multiclass_brier_score" (to enforce the greater is better model selection convention that is common to all our scorer string names) while we would also introduce two binary_brier_score, multiclass_brier_score functions that would probably share some common private helper code under the hood. We would ensure that the docstring of those functions would cross-reference each other explain the difference in normalization constant.

The binary_brier_score would raise if passed a y_observed value (or y_proba_pred) with n_classes > 2 with a message pointing to use the multiclass_ variant instead.

The multiclass_ variants on the other hand could accept binary data without raising or warning I think.

Then we would deprecate the "neg_brier_score" name and brier_score function in favor or the binary_ prefixed counter part.

If that's right I can start working on reviving this PR to implement this plan.

@lorentzenchr
Copy link
Member

The first step, if I recall correctly, would be to add keyword to brier_score_loss to switch between normalizations.
A 2nd step is about the different strings to select a score. I'm still not too happy about a proliferation of names because a user can always use the the score directly with the normalization parameter.

Copy link

github-actions bot commented Nov 6, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4f00c63. Link to the linter CI: here

@antoinebaker
Copy link
Contributor

antoinebaker commented Nov 6, 2024

For the changelog, I first tried two entries 22046.feature.rst and 18699.feature.rst with the same content. That indeed works for towncrier, it merges them into one single entry and credits both PR #22046 and #18699. But then unfortunately the "Check Changelog" complains, because I guess #18699 has not been merged, so I only keep this PR entry 22046.feature.rst.

Comment on lines 3546 to 3549


def multiclass_brier_score_loss(y_true, y_prob, sample_weight=None, labels=None):
r"""Compute the Brier score loss.
Copy link
Contributor

@antoinebaker antoinebaker Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel Should we add a @validate_params here ? It seems that all metrics have this decorator now. Also should we rename y_prob -> y_proba ? It seems to be the new convention (eg in brier_score_loss)

@antoinebaker
Copy link
Contributor

Should we add array api support for multiclass_brier_score_loss and brier_score_loss or do this in a follow up PR ?

@lorentzenchr
Copy link
Member

Follow-up PR. Let's keep different features well separated by PRs.

@antoinebaker
Copy link
Contributor

antoinebaker commented Nov 12, 2024

@ogrisel @lorentzenchr have you settled on the scorer strings / normalization keyword debate ? It's not clear to me with the above comments whether you prefer:

  1. deprecate brier_score_loss in favor of binary_brier_score_loss and corresponding scoring strings
  2. introduce a normalization keyword to brier_score_loss

@lorentzenchr
Copy link
Member

I am -1000 for 2 different functions for binary and multiclass. I want a single function dealing with both, as does log_loss.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 14, 2024

Let's keep y_prob, it's still used a lot elsewhere and y_proba sounds like a weird Frenchism.

@antoinebaker
Copy link
Contributor

antoinebaker commented Dec 4, 2024

The normalize keyword was added to brier_score_loss with the meaning "divides by the number of classes".

Then b60cb3b results in the following for brier_score_loss:

  • the unnormalized Brier score ranges from 0 to 2
  • the normalized Brier score ranges from 0 to 2/n_classes
  • labels was added for the multi-class support (required when not all classes appear in y_true)
  • pos_label was kept for binary classification (required when y_true contains strings)

There is some ambiguity on binary vs multiclass when n_classes=2, especially about the usage of pos_label vs labels. Currently the following choice has been made:

  • targets y_true must be passed as a 1D array (n_samples,)
  • pos_label is only used when y_prob.shape=(n_samples,) which is considered as a binary classification task
  • labels is only used when y_prob.shape=(n_sample, n_classes) which is considered as a multiclass task (even when n_classes=2)

@lorentzenchr @ogrisel are you happy with the current API ?
If so, I can continue on cleaning up the tests, docs and scoring strings.

@ogrisel
Copy link
Member Author

ogrisel commented Dec 6, 2024

I think the API makes sense, apart from the meaning of normalize=True.

However, I am not sure about the normalize=True parameter with 2/n_classes. For three class problems, that would mean Brier score values between 0 and 2/3? Who would want to compute this?

Shall we not just error when normalize=True and n_classes > 2? Then we issue a FutureWarning to state that in the future normalize=False will become the default, meaning that Brier score will always be between 0 and 2, including for binary classification problems.

Alternatively, normalize=True could mean to divide by 2 instead of by n_classes so that normalize=True means that Brier score is always scaled between 0 (best) and 1 (worst), whatever the number of classes.

This parameter could be named scaled_by_half=True instead or something else more explicit than normalize.

@antoinebaker
Copy link
Contributor

I think the API makes sense, apart from the meaning of normalize=True.

However, I am not sure about the normalize=True parameter with 2/n_classes. For three class problems, that would mean Brier score values between 0 and 2/3? Who would want to compute this?

I guess only for the correspondence with the MSE, if we divide by n_classes the Brier score is indeed the MSE. But it doesn't seem to be a common practice, I prefer your scaled_by_half parameter.

What about scaled_by_half with a default "auto" option meaning scaled_by_half=(n_classes==2) ?
The Brier score then follows the convention by default and remains backward compatible.

@ogrisel
Copy link
Member Author

ogrisel commented Dec 9, 2024

+1 for scaled_by_half="auto" that divides by 2 whenever n_classes=True.

This way we can automatically switch between the two formula of the Wikipedia page:

https://en.wikipedia.org/wiki/Brier_score

It's also backward compatible with the current implementation, while also giving control to the user if they prefer to use a different convention.

What do you think of this proposal @lorentzenchr?

@lorentzenchr
Copy link
Member

In my opinion, it's important to keep the goal in sight: one Brier scoring function, ideally with no scaling (always range [0, 2]).

  • So yes, I'm ok with scaled_by_half.
  • I would rename it to scale_by_half.
  • I would start with the default scale_by_half="deprecated" and later set the default to True.

@ogrisel
Copy link
Member Author

ogrisel commented Dec 9, 2024

I would rename it to scale_by_half.

+1

I would start with the default scale_by_half="deprecated" and later set the default to True.

To me, scale_by_half=True means a BS range of [0, 1] (instead of [0, 2]) and would typically be used for the binary classification setting where it's common to use the [0, 1] range.

I am ok with raising a FutureWarning to change the scale_by_half default value to False in the future (to always use the [0, 2] range): in the long term this will make our code simpler to maintain by only implementing the convention introduced in the original definition by Brier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multiclass support for brier_score_loss
6 participants