Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New example about how to implement the SuperLearner in Python #30398

Open
judithabk6 opened this issue Dec 3, 2024 · 4 comments
Open

New example about how to implement the SuperLearner in Python #30398

judithabk6 opened this issue Dec 3, 2024 · 4 comments
Labels
Documentation Needs Decision - Include Feature Requires decision regarding including feature

Comments

@judithabk6
Copy link
Contributor

Describe the issue linked to the documentation

The SuperLearner is a stacking strategy that is very used in fields like Statistics (for instance in causal inference, survival analysis etc) to obtain a good machine learning model fitted to your data without caring too much about model selection. It is implemented as an R package with a good documentation, but not available off-the-shelf in Python, while it is not very difficult to do with Scikit-Learn

Suggest a potential alternative/fix

Probably not in the spirit of Scikit-Learn to implement it, but a good example explaining briefly what it is, and how to do it in a nice way in Scikit-Learn could be super helpful!

happy to help (either write, review etc) if needed

@judithabk6 judithabk6 added Documentation Needs Triage Issue requires triage labels Dec 3, 2024
@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2024

The main paper was published in 2010 and was cited more than 400 times so would meet our inclusion criteria.

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Super+Learner+In+Prediction&btnG=

The algorithm is presented in section 2 of the paper:

image

It seems very close to our StackingClassifier/Regressor model where the second stage model is a Ridge classifier or regressor model with positivity constraints on the coefficients (and an extra constraint that they should sum to one):

https://scikit-learn.org/stable/modules/ensemble.html#stacked-generalization

The main difference with what we currently have in scikit-learn is the preconfigured list of base estimators used to populate the first stage.

Things that we could explore:

  • improve the existing usage example of stacking in scikit-learn to provide a better list of base models (e.g. mixing MLPs, tree-based models and linear models with non-linear feature engineering, e.g. splines and/or kernel approximators) and illustrate this on a more realistic dataset with a mix of categorical and numerical features;
  • consider providing a preconfigured list of base estimators/pipelines as part of the libraries itself;
  • maybe create a superlearner package under scikit-learn-contrib that tries to mimic the default config and high-level features of the R package?

Our only example is:
https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html

@ogrisel ogrisel added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Dec 3, 2024
@judithabk6
Copy link
Contributor Author

Totally agreed that it is not to be implemented in Scikit-Learn. The idea would be to include it in the documentation through an example, either in the existing stacking example or through a new example. With an explicit mention of the SuperLearner paper and/or package in the example? (so that it is findable when looking for "super learner python" on your favorite search engine), wdyt?

your other suggestions are really interesting too, maybe in a second step if needed? depending on the reach of use of the example base code (not sure how to measure it directly, but communication metrics can help)

@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2024

Providing a list of good based linear pipelines might also be useful for #6329 (greedy ensemble) which is very related.

@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2024

I also agree that mentionning "SuperLearner" either in the docstring of the stacking meta-estimators, or in the user guide or in the example or in all of the above might be helpful for googleability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Needs Decision - Include Feature Requires decision regarding including feature
Projects
None yet
Development

No branches or pull requests

2 participants