-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement temperature scaling for (multi-class) calibration #28574
Comments
@glemaitre WDYT? |
This sounds like a simple yet strong and very popular baseline. Note: the |
It appears that the temperature scaling neural network is defined by the function: where So:
Did I understand it correctly? |
Almost. It would be
|
In my opinion, post-hoc calibration of multiclass classification is an unsolved problem. A lot of heuristic arguments are used and it is difficult a choose one method over others. Classifier calibration: a survey on how to assess and improve predicted class probabilities gives a good overview. |
Could we make it solvable by constraining ourselves with the "matrix scaling" case (and hence the "temperature" and "vector" cases)? It also seems like we have fewer arguments to worry about. However, there might be a chance of misunderstanding on my part as well. |
Post-hoc calibration is essentially about learning a classifier whose inputs are the model outputs that should be calibrated. In this sense, post-hoc calibration of multiclass classification is an unsolved problem in the same sense that classification itself is an unsolved problem, because there is no free lunch. Still, scikit-learn provides a lot of classifiers, so why not also provide the "logistic regression" of post-hoc calibration methods? |
Not the hole story. Post-hoc classification estimates |
Thank you for the replies; they are really helpful! I see there are two points of interest:
(Let me know if I missed any.) The former feels more like a hypothesis testing problem than a software development problem. For the latter, I can mirror the Would it be acceptable? |
Sure, the 1-dimensional problem is "easy" and the general multi-class case is not efficiently solvable in general. But why does that justify not offering a method that is often much better than doing nothing? |
|
Note that |
Hello, I've implemented a simple version of temperature scaling with The function takes the following arguments:
I've provided my proof of work below: import numpy as np
from scipy.optimize import minimize
from scipy.special import softmax
def _row_max_normalization(data: np.ndarray) -> np.ndarray:
'''Normalise the output by subtracting
the per-row maximum element.
'''
row_max: np.ndarray = np.max(data,
axis = 1,
keepdims = True
)
return data - row_max
def _softmax_T(predictions: np.ndarray,
temperature: float,
) -> np.ndarray:
'''Softmax function scaled by the
inverse temperature.
'''
softmax_T_output: np.ndarray = predictions
softmax_T_output = _row_max_normalization(softmax_T_output)
softmax_T_output /= temperature
softmax_T_output = softmax(softmax_T_output,
axis = 1
)
softmax_T_output = softmax_T_output.astype(dtype = predictions.dtype)
return softmax_T_output
def _exp_T(predictions: np.ndarray,
temperature: float
) -> np.ndarray:
'''Scale by inverse temperature,
and then apply the nature
exponential function
'''
exp_T_output: np.ndarray = predictions
exp_T_output = _row_max_normalization(exp_T_output)
exp_T_output /= temperature
exp_T_output = np.exp(exp_T_output)
return exp_T_output
def _temperature_scaling(predictions: np.ndarray,
labels: np.ndarray,
initial_temperature: float
) -> float:
def negative_log_likelihood(temperature: float):
'''Negative Log Likelihood Loss with respect
to Temperature
'''
# Losses
losses: np.ndarray = _softmax_T(predictions,
temperature
)
# Select the probability of the correct class
losses = losses[np.arange(losses.shape[0]),
labels
]
losses = np.log(losses)
# Derivates with respect to Temperature
exp_T: np.ndarray = _exp_T(predictions, temperature)
exp_T_sum = exp_T.sum(axis = 1)
term_1: np.ndarray = _row_max_normalization(predictions)
term_1 /= temperature ** 2
term_1 = - term_1[np.arange(term_1.shape[0]),
labels
]
term_1 *= exp_T_sum
term_2: np.ndarray = _row_max_normalization(predictions)
term_2 /= temperature ** 2
term_2 = _row_max_normalization(term_2)
term_2 *= exp_T
term_2 = term_2.sum(axis = 1)
dL_dts: np.ndarray = (term_1 + term_2) / exp_T_sum
# print(f"{-losses.sum() = }, {-dL_dts.sum() = }")
return -losses.sum(), -dL_dts.sum()
temperature_minimizer: minimize = minimize(negative_log_likelihood,
initial_temperature,
method = "L-BFGS-B",
jac = True,
options = {"gtol": 1e-6,
"ftol": 64 * np.finfo(float).eps,
}
)
return temperature_minimizer.x[0] I tested the function with the iris dataset, employing both the support vector classifier and the logistic regressor. The initial temperature of 1.5 converged to 0.15 and 0.12, respectively: >>> import numpy as np
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.svm import SVC
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = datasets.load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> SVC_classifier = SVC(probability=True)
>>> SVC_classifier.fit(X_train, y_train)
>>> svc_predictions = SVC_classifier.predict_proba(X_train)
>>> Logistic_Regression = LogisticRegression()
>>> Logistic_Regression.fit(X_train, y_train)
>>> logistic_predictions = Logistic_Regression.predict_proba(X_train)
>>> _temperature_scaling(svc_predictions, y_train, 1.5)
0.1491147513643915
>>> _temperature_scaling(logistic_predictions, y_train, 1.5)
0.1197697499802383 Any comments are welcome! Hope this can bring something new to the table. |
Thank you! I don't have time to check the code in detail right now, but in the testing, you apply temperature scaling to probabilities instead of logits (e.g. log-probabilities). |
Thank you for the feedback! I just realized the term "logistic vector," and I think I can fix that. Regarding the optimal temperature being |
I don't know if there is a specific reference for the optimal temperature being >= 1, this is my intuition for the following reason: The original temperature scaling paper shows that it is okay to stop the training late (on accuracy), when the cross-entropy is already starting to overfit / become overconfident, and correct it with temperature scaling. Temperatures > 1 make the predictions less confident, so it would make sense. I also realized that it might be good to use some constrained optimization, to prevent reaching temperatures <= 0, but I don't know what would be the best way to do that. |
I've made to my code in my fork of scikit-learn. Specifically, the "logistic vector" and "optimal temperature However, I couldn't test the code because of the following error: >>>CalibratedClassifierCV(base_classifier, cv=3, method='temperature')
InvalidParameterError: The 'method' parameter of CalibratedClassifierCV must be a string among {'isotonic', 'sigmoid'}. Got 'temperature' instead. Is there a way to temporarily suppress parameter validation for further testing? |
@scikit-learn/core-devs ping for a decision. I still stand by my comment #28574 (comment), therefore -1 (until someone can show a clear improvement). |
re: #28574 (comment) @lorentzenchr I think one can pull citation about things being not useful for about pretty much anything. On the other hand, there seem to be enough people who care about this and the literature seems descent. Also, it seems a rather easy enough method to implement that the maintenance burden shouldn't be high. So I think we should include this. |
@adrinjalali I don’t follow. The one literature I cited just gives a good & recent overview of the topic, in fact it advertises post-hoc calibration. My point is that (post hoc) calibration for multiclass is really an unsolved problem because there is no order for a vector of dim 3 or higher, i.e. the vector of probabilities |
Temperature Scaling is a very standard method. It seems far-stretched not to accept it, IMHO
|
@virchan you'd need to edit the source code to remove the validations temporarily. @lorentzenchr it seems empirically this brings value on non-NN algorithms as well, if I read this thread correctly. So I'm not sure why you think this doesn't bring value. This is what I think would be nice to have, to move this discussion forward:
|
Hello all, I have opened a PR #29517 implementing temperature scaling for multi-class classification within the |
Could you please point me to it because I have not found such an evidence. |
I‘m really open to be convinced, but I need evidence for non-neural-net models.
Until then, I am -1 on this.
It's really a classic method, super heavily used. Why block it?
|
My coauthors and I created a benchmark, which is soon to be published at NeurIPS 2024 and available open-source at https://github.com/dholzmueller/pytabkit and https://arxiv.org/abs/2407.04491 |
@dholzmueller is it correct that you used classification error? Could you show/produce results for log loss? |
I used classification error in the paper but I used logloss and Brier score to assess temperature scaling in the results I was referring to. I can try to dig them up later if you are interested. |
Alright, some numbers on the multi-class datasets of our meta-train benchmark (XGB-TD is XGBoost with our meta-learned default parameters):
Geometric mean (log-loss + 0.01):
|
Seems like according to our governance, we need to call for a vote on this one to resolve the issue. |
Please wait. |
Main Topic@dholzmueller Thank you so much. This is the first time I see the effect of temperature scaling for non-NN models (for classification). Your numbers seem to clearly indicate an improvement. Therefore, as announced, I change my decision to +1 for inclusion. ImplementationThe implementation should use the parametrization with multiplication instead of division as proposed in #28574 (comment). Meta DiscussionI want to stress out the following points about the discussion that has happened:
|
Describe the workflow you want to enable
It would be great to have temperature scaling available as a post-hoc calibration method for binary and multi-class classifiers, for example in
CalibratedClassifierCV
.Describe your proposed solution
Temperature scaling is a simple, efficient, and very popular post-hoc calibration method that also naturally supports the multi-class classification setting. It has been proposed in Guo et al. (2017) with >5000 citations, so it meets the inclusion criterion: http://proceedings.mlr.press/v70/guo17a.html
It also does not affect rank-based metrics (if the temperature is restricted to positive values) unlike isotonic regression (#16321). Moreover, it avoids the infinite-log-loss problems of isotonic regression.
Temperature scaling has been discussed in #21785
I experimented with different post-hoc calibration methods on 71 medium-sized (2K-50K samples) tabular classification data sets. For NNs and XGBoost, temperature scaling is competitive with isotonic regression and considerably better than Platt scaling (if Platt scaling is applied to probabilities, as implemented in scikit-learn, and not logits). For AUC, it is considerably better than isotonic regression.
Here is a simple implementation using PyTorch (can be adapted to numpy). It is derived from the popular but no longer maintained implementation at https://github.com/gpleiss/temperature_scaling/blob/master/temperature_scaling.py
with the following changes:
predict_proba()
. The code converts probabilities to logits usinglog(probs + 1e-10)
. While the logits are only determined up to a constant shift, the choice of the constant does not affect the result of temperature scaling.Describe alternatives you've considered, if relevant
Centered isotonic regression (#21454) is less popular and does not fully solve the problem of affecting rank-based metrics.
Beta-calibration (#25552) seems very similar or even partially identical but is less well-cited, and only formulated for binary classification.
Additional context
No response
The text was updated successfully, but these errors were encountered: