Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement temperature scaling for (multi-class) calibration #28574

Open
dholzmueller opened this issue Mar 4, 2024 · 32 comments
Open

Implement temperature scaling for (multi-class) calibration #28574

dholzmueller opened this issue Mar 4, 2024 · 32 comments
Labels
help wanted Moderate Anything that requires some knowledge of conventions and best practices module:calibration New Feature

Comments

@dholzmueller
Copy link

Describe the workflow you want to enable

It would be great to have temperature scaling available as a post-hoc calibration method for binary and multi-class classifiers, for example in CalibratedClassifierCV.

Describe your proposed solution

Temperature scaling is a simple, efficient, and very popular post-hoc calibration method that also naturally supports the multi-class classification setting. It has been proposed in Guo et al. (2017) with >5000 citations, so it meets the inclusion criterion: http://proceedings.mlr.press/v70/guo17a.html
It also does not affect rank-based metrics (if the temperature is restricted to positive values) unlike isotonic regression (#16321). Moreover, it avoids the infinite-log-loss problems of isotonic regression.
Temperature scaling has been discussed in #21785
I experimented with different post-hoc calibration methods on 71 medium-sized (2K-50K samples) tabular classification data sets. For NNs and XGBoost, temperature scaling is competitive with isotonic regression and considerably better than Platt scaling (if Platt scaling is applied to probabilities, as implemented in scikit-learn, and not logits). For AUC, it is considerably better than isotonic regression.

Here is a simple implementation using PyTorch (can be adapted to numpy). It is derived from the popular but no longer maintained implementation at https://github.com/gpleiss/temperature_scaling/blob/master/temperature_scaling.py
with the following changes:

  • using inverse temperatures to prevent division by zero errors
  • using 50 optimizer steps instead of a single one (seemingly an error in the mentioned repo). (The original paper mentions that 10 CG iterations should be enough, here it is 50 L-BFGS iterations.)
  • accepting probabilities as provided by many scikit-learn estimators using predict_proba(). The code converts probabilities to logits using log(probs + 1e-10). While the logits are only determined up to a constant shift, the choice of the constant does not affect the result of temperature scaling.
import torch
import torch.nn as nn
import numpy as np
from sklearn.base import BaseEstimator

class InverseTemperatureScalingCalibrator(BaseEstimator):
    # following https://github.com/gpleiss/temperature_scaling/blob/master/temperature_scaling.py
    def _get_logits(self, X):
        X = X + 1e-10
        X /= np.sum(X, axis=-1, keepdims=True)
        return torch.as_tensor(np.log(X), dtype=torch.float32)

    def fit(self, X, y):
        # X should be the probabilities as output by predict_proba()
        logits = self._get_logits(X)
        labels = torch.as_tensor(y)
        self.inv_temperature_ = nn.Parameter(torch.ones(1) / 1.5)
        criterion = nn.CrossEntropyLoss()

        optimizer = torch.optim.LBFGS([self.inv_temperature_], lr=0.01, max_iter=50)

        def eval():
            optimizer.zero_grad()
            y_pred = logits * self.inv_temperature_[:, None]
            loss = criterion(y_pred, labels)
            loss.backward()
            return loss

        for i in range(50):
            optimizer.step(eval)

        print(f'Optimal temperature: {(1./self.inv_temperature_).item():g}')
        return self

    def predict_proba(self, X):
        # X should be the probabilities as output by predict_proba()
        logits = self._get_logits(X)
        with torch.no_grad():
            y_pred = logits * self.inv_temperature_[:, None]
            return torch.softmax(y_pred, dim=-1).detach().numpy()

Describe alternatives you've considered, if relevant

Centered isotonic regression (#21454) is less popular and does not fully solve the problem of affecting rank-based metrics.
Beta-calibration (#25552) seems very similar or even partially identical but is less well-cited, and only formulated for binary classification.

Additional context

No response

@dholzmueller dholzmueller added Needs Triage Issue requires triage New Feature labels Mar 4, 2024
@adrinjalali
Copy link
Member

@glemaitre WDYT?

@ogrisel
Copy link
Member

ogrisel commented Mar 4, 2024

This sounds like a simple yet strong and very popular baseline.

Note: the _get_logits trick to inverse a sigmoid / softmax is a bit of a hack but scikit-learn does not provide a generic way to access raw logits in general. We could leverage predict_log_proba when it exists but I am not 100% sure if it's always equivalent. For some estimators, such as random forests, we do provide predict_log_proba but it's just calling np.log on predict_proba which is likely to be less stable than the _get_logit trick suggested here.

@adrinjalali adrinjalali removed the Needs Triage Issue requires triage label Mar 5, 2024
@virchan
Copy link
Contributor

virchan commented Mar 10, 2024

It appears that the temperature scaling neural network is defined by the function:

$$\left[ y_1, \cdots, y_n \right] \mapsto \mathrm{softmax}\left( \left[ \frac{y_1}{T}, \cdots, \frac{y_n}{T} \right] \right),$$

where $\displaystyle \sum_{i=1}^n y_i = 1$; and $T$ is the temperature parameter.

So:

  1. "Training the model" means "optimising the loss with respect to $T$".
  2. The trained model needs to remember $T$, and possibly the input dimension $n$.

Did I understand it correctly?

@dholzmueller
Copy link
Author

Almost. It would be
$\left[ y_1, \cdots, y_n \right] \mapsto \mathrm{softmax}\left( \left[ \frac{\log(y_1)}{T}, \cdots, \frac{\log(y_n)}{T} \right] \right)$
if the $y_i$ are probabilities (actually, they don't need to be normalized for this to work).

  1. Indeed, and one could support different loss functions like log-loss and Brier loss. What I did above is to optimize $\beta = 1/T$ instead so a division by zero cannot occur.
  2. It needs to remember $T$, but not $n$.

@lorentzenchr
Copy link
Member

In my opinion, post-hoc calibration of multiclass classification is an unsolved problem. A lot of heuristic arguments are used and it is difficult a choose one method over others. Classifier calibration: a survey on how to assess and improve predicted class probabilities gives a good overview.

@virchan
Copy link
Contributor

virchan commented Mar 12, 2024

Could we make it solvable by constraining ourselves with the "matrix scaling" case (and hence the "temperature" and "vector" cases)? It also seems like we have fewer arguments to worry about. However, there might be a chance of misunderstanding on my part as well.

@dholzmueller
Copy link
Author

Post-hoc calibration is essentially about learning a classifier whose inputs are the model outputs that should be calibrated. In this sense, post-hoc calibration of multiclass classification is an unsolved problem in the same sense that classification itself is an unsolved problem, because there is no free lunch. Still, scikit-learn provides a lot of classifiers, so why not also provide the "logistic regression" of post-hoc calibration methods?

@lorentzenchr
Copy link
Member

In this sense, post-hoc calibration of multiclass classification is an unsolved problem in the same sense that classification itself is an unsolved problem

Not the hole story. Post-hoc classification estimates $P(Y|m(X))$ with model $m$, while the original goal is to estimate $P(Y|X)$. Note the difference in conditioning! For binary classification, $m(X)$ is essentially 1-dimensional and therefore $P(Y|m(X))$ is amenable to estimation.

@virchan
Copy link
Contributor

virchan commented Mar 13, 2024

Thank you for the replies; they are really helpful!

I see there are two points of interest:

  1. Decide if the calibration methods improve the original predictions in a generic data project.
  2. Include (at least some) post-hoc calibration methods in scikit-learn's framework.

(Let me know if I missed any.)

The former feels more like a hypothesis testing problem than a software development problem.

For the latter, I can mirror the MLPClassifier module to create the "matrix scaling" calibration method, and I don't mind working on it at all.

Would it be acceptable?

@dholzmueller
Copy link
Author

Not the hole story. Post-hoc classification estimates P(Y|m(X)) with model m, while the original goal is to estimate P(Y|X). Note the difference in conditioning! For binary classification, m(X) is essentially 1-dimensional and therefore P(Y|m(X)) is amenable to estimation.

Sure, the 1-dimensional problem is "easy" and the general multi-class case is not efficiently solvable in general. But why does that justify not offering a method that is often much better than doing nothing?

@dholzmueller
Copy link
Author

@virchan

  1. has been evaluated at least in the original paper I mentioned, and probably in many others as well.
  2. The "matrix scaling" method mostly performed much worse than temperature scaling in the original paper, so I think temperature scaling should have higher priority.

@lorentzenchr
Copy link
Member

Sure, the 1-dimensional problem is "easy" and the general multi-class case is not efficiently solvable in general. But why does that justify not offering a method that is often much better than doing nothing?

Note that CalibratedClassifierCV has multiclass support via OvR, see https://scikit-learn.org/stable/modules/calibration.html#multiclass-support.
If another methods shows clear improvement (best done in highly cited literature), we are open, even happy to include it.

@virchan
Copy link
Contributor

virchan commented Apr 17, 2024

Hello,

I've implemented a simple version of temperature scaling with numpy and scipy using the _temperature_scaling function.

The function takes the following arguments:

  • predictions: (n_samples, n_classes) array, the output of the predict_proba method.
  • labels: (n_samples, ) array, the correct label in ordinal fashion.
  • initial_temperature: float, the initial temperature to start with (as seen in the OP, it's set to 1.5).

I've provided my proof of work below:

import numpy as np
from scipy.optimize import minimize
from scipy.special import softmax

def _row_max_normalization(data: np.ndarray) -> np.ndarray:
    '''Normalise the output by subtracting
       the per-row maximum element.
    '''
    row_max: np.ndarray = np.max(data, 
                                 axis = 1, 
                                 keepdims = True
                                )
    
    return data - row_max

def _softmax_T(predictions: np.ndarray, 
               temperature: float,
              ) -> np.ndarray:
    '''Softmax function scaled by the
       inverse temperature.
    '''
    
    softmax_T_output: np.ndarray = predictions
    softmax_T_output = _row_max_normalization(softmax_T_output)  
    softmax_T_output /= temperature  
    softmax_T_output = softmax(softmax_T_output, 
                               axis = 1
                              )
    softmax_T_output = softmax_T_output.astype(dtype = predictions.dtype)
    
    return softmax_T_output

def _exp_T(predictions: np.ndarray, 
           temperature: float
          ) -> np.ndarray:
    '''Scale by inverse temperature,
       and then apply the nature
       exponential function
    '''
    
    exp_T_output: np.ndarray = predictions
    exp_T_output = _row_max_normalization(exp_T_output)
    exp_T_output /= temperature
    exp_T_output = np.exp(exp_T_output)
    
    return exp_T_output 

def _temperature_scaling(predictions: np.ndarray, 
                         labels: np.ndarray, 
                         initial_temperature: float
                        ) -> float:
    
    def negative_log_likelihood(temperature: float):
        '''Negative Log Likelihood Loss with respect
           to Temperature
        '''
        
        # Losses
        losses: np.ndarray = _softmax_T(predictions, 
                                          temperature
                                         )
            
        # Select the probability of the correct class
        losses = losses[np.arange(losses.shape[0]), 
                        labels
                       ]
        
        losses = np.log(losses)
        
        # Derivates with respect to Temperature
        exp_T: np.ndarray = _exp_T(predictions, temperature)
        exp_T_sum = exp_T.sum(axis = 1)
        
        term_1: np.ndarray = _row_max_normalization(predictions)
        term_1 /= temperature ** 2
        term_1 = - term_1[np.arange(term_1.shape[0]), 
                          labels
                         ]
        term_1 *= exp_T_sum
        
        
        
        term_2: np.ndarray = _row_max_normalization(predictions)
        term_2 /= temperature ** 2
        term_2 = _row_max_normalization(term_2)
        term_2 *= exp_T
        term_2 = term_2.sum(axis = 1)
        
        dL_dts: np.ndarray = (term_1 + term_2) / exp_T_sum
            
        # print(f"{-losses.sum() = },  {-dL_dts.sum() = }")
            
        return -losses.sum(),  -dL_dts.sum()
    
    temperature_minimizer: minimize = minimize(negative_log_likelihood, 
                                               initial_temperature, 
                                               method = "L-BFGS-B",
                                               jac = True,
                                               options = {"gtol": 1e-6,
                                                           "ftol": 64 * np.finfo(float).eps,
                                                         }
                                              )
        
    return temperature_minimizer.x[0]

I tested the function with the iris dataset, employing both the support vector classifier and the logistic regressor. The initial temperature of 1.5 converged to 0.15 and 0.12, respectively:

>>> import numpy as np
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.svm import SVC
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = datasets.load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> SVC_classifier = SVC(probability=True)
>>> SVC_classifier.fit(X_train, y_train)
>>> svc_predictions = SVC_classifier.predict_proba(X_train)
>>> Logistic_Regression = LogisticRegression()
>>> Logistic_Regression.fit(X_train, y_train)
>>> logistic_predictions = Logistic_Regression.predict_proba(X_train)
>>> _temperature_scaling(svc_predictions, y_train, 1.5)
0.1491147513643915
>>> _temperature_scaling(logistic_predictions, y_train, 1.5)
0.1197697499802383

Any comments are welcome! Hope this can bring something new to the table.

@dholzmueller
Copy link
Author

Thank you! I don't have time to check the code in detail right now, but in the testing, you apply temperature scaling to probabilities instead of logits (e.g. log-probabilities).
Also, regarding the testing, temperature scaling should usually be performed on the validation and not on the training set. For methods using cross-entropy loss like logistic regression, the optimal temperatures should then usually be >=1 unless the validation accuracy is 100%. We could also test that the cross-entropy loss decreases after temperature scaling.

@virchan
Copy link
Contributor

virchan commented Apr 24, 2024

Thank you for the feedback!

I just realized the term "logistic vector," and I think I can fix that.

Regarding the optimal temperature being $\geq 1$, is there a reference for that? I would like to learn more as well.

@dholzmueller
Copy link
Author

I don't know if there is a specific reference for the optimal temperature being >= 1, this is my intuition for the following reason: The original temperature scaling paper shows that it is okay to stop the training late (on accuracy), when the cross-entropy is already starting to overfit / become overconfident, and correct it with temperature scaling. Temperatures > 1 make the predictions less confident, so it would make sense.

I also realized that it might be good to use some constrained optimization, to prevent reaching temperatures <= 0, but I don't know what would be the best way to do that.

@virchan
Copy link
Contributor

virchan commented May 13, 2024

I've made to my code in my fork of scikit-learn. Specifically, the "logistic vector" and "optimal temperature $\geq$ 1" parts are fixed. These changes can be found in the sklearn/calibration_temperature.py file.

However, I couldn't test the code because of the following error:

>>>CalibratedClassifierCV(base_classifier, cv=3, method='temperature')
InvalidParameterError: The 'method' parameter of CalibratedClassifierCV must be a string among {'isotonic', 'sigmoid'}. Got 'temperature' instead.

Is there a way to temporarily suppress parameter validation for further testing?

@lorentzenchr
Copy link
Member

@scikit-learn/core-devs ping for a decision.

I still stand by my comment #28574 (comment), therefore -1 (until someone can show a clear improvement).

@adrinjalali
Copy link
Member

re: #28574 (comment)

@lorentzenchr I think one can pull citation about things being not useful for about pretty much anything. On the other hand, there seem to be enough people who care about this and the literature seems descent. Also, it seems a rather easy enough method to implement that the maintenance burden shouldn't be high. So I think we should include this.

@lorentzenchr
Copy link
Member

@adrinjalali I don’t follow. The one literature I cited just gives a good & recent overview of the topic, in fact it advertises post-hoc calibration.

My point is that (post hoc) calibration for multiclass is really an unsolved problem because there is no order for a vector of dim 3 or higher, i.e. the vector of probabilities $P(Y) \in R^c$.
Temperature scaling is just a heuristic mostly used for neural nets. Even for those models, the advantage is debatable. For logistic regression and tree based models which we focus on I have not seen results and I would be surprised to see (uncontroversial) positive results.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 15, 2024 via email

@adrinjalali
Copy link
Member

Is there a way to temporarily suppress parameter validation for further testing?

@virchan you'd need to edit the source code to remove the validations temporarily.

@lorentzenchr it seems empirically this brings value on non-NN algorithms as well, if I read this thread correctly. So I'm not sure why you think this doesn't bring value.

This is what I think would be nice to have, to move this discussion forward:

  • A prototype PR to see the implementation
  • A few case studies which are are tree based and other non-NN algorithms to show its benefits (which I think is already partly provided in this thread). Specifically, a counter-argument to this would make the discussion much easier:

Temperature scaling is just a heuristic mostly used for neural nets. Even for those models, the advantage is debatable. For logistic regression and tree based models which we focus on I have not seen results and I would be surprised to see (uncontroversial) positive results.

@virchan
Copy link
Contributor

virchan commented Jul 18, 2024

Hello all,

I have opened a PR #29517 implementing temperature scaling for multi-class classification within the CalibratedClassifierCV class. You can find the details and code in the PR. Comments and feedback are more than welcome.

@lorentzenchr
Copy link
Member

it seems empirically this brings value on non-NN algorithms as well, if I read this thread correctly.

Could you please point me to it because I have not found such an evidence.
I‘m really open to be convinced, but I need evidence for non-neural-net models.
Until then, I am -1 on this.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 14, 2024 via email

@dholzmueller
Copy link
Author

it seems empirically this brings value on non-NN algorithms as well, if I read this thread correctly.

Could you please point me to it because I have not found such an evidence. I‘m really open to be convinced, but I need evidence for non-neural-net models. Until then, I am -1 on this.

My coauthors and I created a benchmark, which is soon to be published at NeurIPS 2024 and available open-source at https://github.com/dholzmueller/pytabkit and https://arxiv.org/abs/2407.04491
I tried temperature scaling on it (not in the paper, maybe in a future one) and my experience was that it worked really well in the multiclass setting, both for XGBoost and NNs. I don't know if there are papers experimenting with this. However, a point of evidence might be that it is used by default in AutoGluon for logloss: https://auto.gluon.ai/dev/api/autogluon.tabular.TabularPredictor.fit.html

@lorentzenchr
Copy link
Member

@dholzmueller is it correct that you used classification error? Could you show/produce results for log loss?
Log loss (and also squared error alias Brier score) is a much better metric for classification because it is a proper scoring rule. The prediction error (1-accuracy) can be tuned by selecting probability cutoffs (instead of temperature scaling).

@dholzmueller
Copy link
Author

I used classification error in the paper but I used logloss and Brier score to assess temperature scaling in the results I was referring to. I can try to dig them up later if you are interested.

@dholzmueller
Copy link
Author

Alright, some numbers on the multi-class datasets of our meta-train benchmark (XGB-TD is XGBoost with our meta-learned default parameters):
Arithmetic mean log-loss:

  • XGB-TD + temperature scaling: 0.2527
  • XGB-TD + sklearn isotonic regression: 0.2591
  • XGB-TD stopped on log-loss: 0.2635

Geometric mean (log-loss + 0.01):

  • XGB-TD + temperature scaling: 0.1095
  • XGB-TD stopped on log-loss: 0.1199
  • XGB-TD + sklearn isotonic regression: 0.1217

@adrinjalali
Copy link
Member

Seems like according to our governance, we need to call for a vote on this one to resolve the issue.

@lorentzenchr
Copy link
Member

Seems like according to our governance, we need to call for a vote on this one to resolve the issue.

Please wait.

@lorentzenchr
Copy link
Member

Main Topic

@dholzmueller Thank you so much. This is the first time I see the effect of temperature scaling for non-NN models (for classification). Your numbers seem to clearly indicate an improvement. Therefore, as announced, I change my decision to +1 for inclusion.

Implementation

The implementation should use the parametrization with multiplication instead of division as proposed in #28574 (comment).

Meta Discussion

I want to stress out the following points about the discussion that has happened:

  1. Post-hoc calibration for classifiers is often (for multiclass mostly) based on heuristics. To judge a heuristic recipe, one needs results from experiments. That was what I asked for.
  2. Those experiments were originally done by Guo et al., but only for neural nets. Scikit-learn could state that different methodologies and algorithms are also meant to improve (dealing with) NNs. I am open to that, but we should then change this FAQ entry.
  3. I find comments about the wide use of a method quite a poor argument for inclusion. Just because everybody does X, does not imply I should do X. Otherwise stated, "Why don't you do X?" is the wrong question (and might create pressure). Instead, I should ask how to deal with the underlying subject, obviously it seems relevant. Maybe, I adopt X, maybe I have a better solution, maybe I just don't do anything about it.

@lorentzenchr lorentzenchr added help wanted Moderate Anything that requires some knowledge of conventions and best practices module:calibration labels Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Moderate Anything that requires some knowledge of conventions and best practices module:calibration New Feature
Projects
None yet
Development

No branches or pull requests

6 participants