New example about how to implement the SuperLearner in Python #30398

judithabk6 · 2024-12-03T10:34:20Z

Describe the issue linked to the documentation

The SuperLearner is a stacking strategy that is very used in fields like Statistics (for instance in causal inference, survival analysis etc) to obtain a good machine learning model fitted to your data without caring too much about model selection. It is implemented as an R package with a good documentation, but not available off-the-shelf in Python, while it is not very difficult to do with Scikit-Learn

Suggest a potential alternative/fix

Probably not in the spirit of Scikit-Learn to implement it, but a good example explaining briefly what it is, and how to do it in a nice way in Scikit-Learn could be super helpful!

happy to help (either write, review etc) if needed

ogrisel · 2024-12-03T10:59:48Z

The main paper was published in 2010 and was cited more than 400 times so would meet our inclusion criteria.

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Super+Learner+In+Prediction&btnG=

The algorithm is presented in section 2 of the paper:

It seems very close to our StackingClassifier/Regressor model where the second stage model is a Ridge classifier or regressor model with positivity constraints on the coefficients (and an extra constraint that they should sum to one):

https://scikit-learn.org/stable/modules/ensemble.html#stacked-generalization

The main difference with what we currently have in scikit-learn is the preconfigured list of base estimators used to populate the first stage.

Things that we could explore:

improve the existing usage example of stacking in scikit-learn to provide a better list of base models (e.g. mixing MLPs, tree-based models and linear models with non-linear feature engineering, e.g. splines and/or kernel approximators) and illustrate this on a more realistic dataset with a mix of categorical and numerical features;
consider providing a preconfigured list of base estimators/pipelines as part of the libraries itself;
maybe create a superlearner package under scikit-learn-contrib that tries to mimic the default config and high-level features of the R package?

Our only example is:
https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html

judithabk6 · 2024-12-03T11:06:12Z

Totally agreed that it is not to be implemented in Scikit-Learn. The idea would be to include it in the documentation through an example, either in the existing stacking example or through a new example. With an explicit mention of the SuperLearner paper and/or package in the example? (so that it is findable when looking for "super learner python" on your favorite search engine), wdyt?

your other suggestions are really interesting too, maybe in a second step if needed? depending on the reach of use of the example base code (not sure how to measure it directly, but communication metrics can help)

ogrisel · 2024-12-03T11:43:29Z

Providing a list of good based linear pipelines might also be useful for #6329 (greedy ensemble) which is very related.

ogrisel · 2024-12-03T11:44:53Z

I also agree that mentionning "SuperLearner" either in the docstring of the stacking meta-estimators, or in the user guide or in the example or in all of the above might be helpful for googleability.

judithabk6 added Documentation Needs Triage Issue requires triage labels Dec 3, 2024

ogrisel added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New example about how to implement the SuperLearner in Python #30398

New example about how to implement the SuperLearner in Python #30398

judithabk6 commented Dec 3, 2024

ogrisel commented Dec 3, 2024 •

edited

Loading

judithabk6 commented Dec 3, 2024

ogrisel commented Dec 3, 2024 •

edited

Loading

ogrisel commented Dec 3, 2024

New example about how to implement the SuperLearner in Python #30398

New example about how to implement the SuperLearner in Python #30398

Comments

judithabk6 commented Dec 3, 2024

Describe the issue linked to the documentation

Suggest a potential alternative/fix

ogrisel commented Dec 3, 2024 • edited Loading

judithabk6 commented Dec 3, 2024

ogrisel commented Dec 3, 2024 • edited Loading

ogrisel commented Dec 3, 2024

ogrisel commented Dec 3, 2024 •

edited

Loading

ogrisel commented Dec 3, 2024 •

edited

Loading