Skip to content

Conversation

@adrinjalali
Copy link
Member

Towards #29893

This deprecates CalibratedClassifierCV(..., cv="prefit") in favor of CalibratedClassifierCV(FrozenEstimator(est)).

Also adds ensemble="auto" option for automatically handling FrozenEstimator correctly.

cc @glemaitre @ogrisel

@github-actions
Copy link

github-actions bot commented Oct 29, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4c17d22. Link to the linter CI: here

@adrinjalali adrinjalali added this to the 1.6 milestone Oct 29, 2024
Copy link
Member

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks simple enough to me, but I'm by no means an expert in this area of the codebase :p

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about this: asking the user to import a class and wrap an argument feels like a small usability regression.

What about not deprecating cv="prefit" and automatically do the wrapping with FrozenEstimator on the fly in _get_estimator:

    def _get_estimator(self):
        """Resolve which estimator to return (default is LinearSVC)"""
        if self.cv == "prefit" and self.estimator is None:
            raise ValueError(
                'estimator cannot be left to None when cv="prefit"'
            )
        if self.estimator is None:
            # we want all classifiers that don't expose a random_state
            # to be deterministic (and we don't want to expose this one).
            estimator = LinearSVC(random_state=0)
            if _routing_enabled():
                estimator.set_fit_request(sample_weight=True)
        else:
            estimator = self.estimator
            if self.cv == "prefit":
                estimator = FrozenEstimator(estimator)
    return estimator

But then we would have too equivalent ways to do the same thing:

from sklearn.frozen import FrozenEstimator

CalibratedClassifierCV(FrozenEstimator(model)).fit(X_cal, y_cal)

and:

CalibratedClassifierCV(model, cv="prefit").fit(X_cal, y_cal)

But maybe that's ok?

@adrinjalali
Copy link
Member Author

I wouldn't say it's a usability regression. cv="prefit" doesn't make any sense semantically. It's just we didn't have any other place to put "please don't fit my estimator" so we've put it under cv.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2024

cv="prefit" doesn't make any sense semantically

I partially agree... But at the same time, calling cross_val_predict on a frozen estimator is just a complex way to do estimator.predict(X_cal) when cv is a CV splitter with non-overlapping and full-coverage validation folds as with the default cv=5.

In that respect, cv="prefit" is a usability improvement in the sense that it's simpler to understand compared to inferring the meaning and statistical implications of a call such as CalibratedClassifierCV(FrozenEstimator(model), cv=5).

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @thomasjpfan and others think it's better, and I am in the minority, then so be it. It's true that this will also allow a small code simplification once the deprecation period is over.

X, y = dict_data
clf = dict_data_pipeline
calib_clf = CalibratedClassifierCV(clf, cv="prefit")
calib_clf = CalibratedClassifierCV(FrozenEstimator(clf), cv=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change highlights another small usability regression: it's difficult to understand why we would need to adjust the CV strategy when calibrating a frozen estimator on a dataset with less than 5 data-points per class. One could expect that cross-fitting would be entirely disabled in that case.

Again, I can live with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with you, but in reality, people with 5 data points shouldn't be doing statistical ML 😁

Copy link
Member

@ogrisel ogrisel Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, meaningful calibration with 5 data points per class is quite challenging. Still, it reveals a rough edge in my opinion. But I believe we can fix it later if needed.

@ogrisel ogrisel merged commit b4eef25 into scikit-learn:main Oct 30, 2024
25 checks passed
@adrinjalali adrinjalali deleted the calibration/cv branch October 31, 2024 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants