Stacking: Add Ensemble Selection from Libraries of Models #6329

GaelVaroquaux · 2016-02-10T15:25:43Z

The following paper by Rich Caruana
http://www.niculescu-mizil.org/papers/shotgun.icml04.revised.rev2.pdf
presents an simple greedy algorithm to stack models.

It is reported by many people to be an excellent strategy in data competitions. We should implement it. The paper has only 333 citations, which is on the low side for our criteria, but I hear a lot of good on it.

kegl · 2016-02-10T15:47:59Z

+1

The method is one of the hidden treasures of ML.

MechCoder · 2016-02-10T23:09:34Z

Interesting. What API do you propose?

The constructor will take a set of fit models during init time and at fit time do the ensemble selection that gives the best score on the data?

jnothman · 2016-02-10T23:23:59Z

Or, if not a set of models as input, a series of (estimator, parameter
grid) pairs.

However, training 2000 models under the current single-machine parallelism
available in scikit-learn suggests perhaps that allowing pre-trained models
is also worthwhile.

I agree that if this is a useful model in practice it is simple enough to
be a real boon to our users. It can also be construed as a successful tweak
to a generic algorithm that has no shortage of citations.

On 11 February 2016 at 10:09, Manoj Kumar [email protected] wrote:

Interesting. What API do you propose?

The constructor will take a set of models during init time and at fit
time do the ensemble selection that gives the best score on the data?

—
Reply to this email directly or view it on GitHub
#6329 (comment)
.

MechCoder · 2016-02-10T23:36:45Z

However, training 2000 models under the current single-machine parallelism ..

Indeed, which is why I proposed allowing fit or pre-trained models and allowing doing just the ensemble selection during fit time.

jnothman · 2016-02-11T00:06:15Z

Aha. I'd not understood that from your fist comment. That approach invalidates its use in things like CV (because the fit models will be wiped on clone unless we find a way to work around that), and it is quite unusual for it only to receive a validation set as its fit argument, which would make it awkward in a pipeline. I'm not convinced that should be the normative usage as encapsulated by an estimator, though it may make sense for a function along the lines of ensemble_selection(fitted_estimators, test_X, test_y, scoring, ...) which returns a VotingClassifier or AveragedRegressor that can then have its predict, etc. called... This too is a bit awkward in scikit-learn land...

GaelVaroquaux · 2016-02-11T06:41:58Z

I was more thinking using it with a few dozen classifiers, for which @kegl has seen good usage in practice, rather than 2000 classifiers.

Combined with the new hyper optimizer in #5491, such a feature would make it easier to build semi generic learning systems that can adapt to the data that they are given.

GaelVaroquaux · 2016-02-11T07:05:31Z

Or, if not a set of models as input, a series of (estimator, parameter
grid) pairs.

This does not fit in the scikit-learn workflow, but we have found it
useful to tune hyper-parameters of the various models before giving them
to the algorithm mentionned in this issue. It doesn't fit in the
workflow, because it takes the "best_estimator" of a CV object, rather
than the CV object itself. However, I would be in favor of leaving that
problem aside for now, coding the algorithm and the object, and adivising
users to write their own code to do this step.

MechCoder · 2016-02-12T06:04:32Z

@jnothman

Oh I just realized that clone recursively removes all "non-hyperparameter" attributes. I agree that this would make it useless in CV etc. It is also clear that the function idea is the most intuitive except that is not how we do things in sklearn ;)

What you propose also has its own shortcomings, we would have to do two fits, one the individual models on the training data and the second the ensemble fit on the validation data and we have to figure out a way in the API to handle this. (Using splitters will make it even more confusing)

Which is why just providing the fit models as parameters seems to me to be the best option as suggested by Gael. If it feels weird about the new ensemble accepting validation data notice that they speak about overfitting on the validation data and a bagged ensemble to handle this.

MechCoder · 2016-02-12T06:28:58Z

And this may not be as important as the ensemble clasifier itself, but I can't figure out what you mean by a fist comment.

nelson-liu · 2016-02-17T04:09:17Z

Is anyone working on this? I just read the paper, and it seems quite interesting; I'd love to implement it.

MechCoder · 2016-02-17T17:55:08Z

Sure, go ahead. We can discuss the pros and cons from there.

Schrodinger1926 · 2016-02-17T18:05:16Z

@MechCoder I've participated in some data science competitions, this can make life really easy.
I'd also like to work on this issue.

yenchenlin · 2016-02-18T13:17:50Z

Hello @nelson-liu ,
I've already implemented a prototype version of this algorithm and is planning to submit a PR in two days after I refactor it.
Have you started to work on this?

Really sorry for my late replying.

Schrodinger1926 · 2016-02-18T13:45:56Z

@yenchenlin1994 I've just started working on it, i'm quit new to open source community so moving little slow. So I'll wait for your PR and hopefully will contribute in that.

nelson-liu · 2016-02-18T15:11:59Z

@yenchenlin1994 I worked a bit on it, but that is fine. Go ahead and submit a PR.

yenchenlin · 2016-02-22T04:42:45Z

Hello @kegl ,
Can you provide which few dozen classifiers you used and which dataset you evaluated on?
I've completed the algorithm and want to test its performance.

giorgiop · 2016-03-05T02:42:48Z

@yenchenlin1994 did you open already a PR for this?

yenchenlin · 2016-03-05T13:16:36Z

Not yet, I'm now writing examples in doc.
Being busy for my exam last week ...

jnothman · 2016-03-05T14:16:40Z

Please submit the PR as a WIP.

On 6 March 2016 at 00:16, Yen [email protected] wrote:

Not yet, I'm now writing examples in doc.
Being busy for my exam last week ...

—
Reply to this email directly or view it on GitHub
#6329 (comment)
.

x3n0cr4735 · 2016-03-14T14:02:10Z

Yes, please open the PR. I'm also new to the open source community but have had good success with stacking/blending methods and would like to see if an ensemble method as described in Caruana et al. will work in practice. Would love to contribute to this.

yenchenlin · 2016-03-14T14:38:39Z

Hi all, sorry for the late reply!
I've opened a PR with a code snippet to test the implementation.
Please have a look.

cmarmo · 2021-12-10T19:37:07Z

@glemaitre you implemented stacking methods in 0.22. Should this issue be closed? Together with #6540?
The references are not the same though: Caruana et al is not cited in the stacking documentation.
Should it be? Thanks for your help.

GaelVaroquaux · 2021-12-10T22:29:49Z

No, this is another method than the stacking that we have in scikit-learn. Such a method is probably complementary, with different useages. In particular, it can select a small number of model from the model library

PierrickPochelu · 2022-10-20T21:47:30Z

I can contribute, I have already implemented a non-scikitlearn parallel version. What is the expected API for this?

I may propose:

Constructor: List_of_models; score_function; max_nb_of_models; and diverse configs.
As input of the fit : X and Y (the calibration samples) ; nb_jobs.
As output of the fit: the optimized smaller list of models (<= max_nb_of_models).

glemaitre · 2022-10-24T07:56:01Z

Regarding the API, it would be easier to check the documentation on how to develop a scikit-learn estimator: https://scikit-learn.org/dev/developers/develop.html

PierrickPochelu · 2022-10-24T14:02:49Z

I am developing it with the unit tests. I am inspired by the Stacking estimator and yenchenlin works.

GaelVaroquaux added the New Feature label Feb 10, 2016

yenchenlin mentioned this issue Mar 14, 2016

[WIP] Add ensemble selection algorithm #6540

Closed

5 tasks

cmarmo added the module:ensemble label Dec 10, 2021

PierrickPochelu mentioned this issue Oct 24, 2022

FEA Ensemble selection from Librairies of Models #24751

Open

ogrisel mentioned this issue Dec 3, 2024

New example about how to implement the SuperLearner in Python #30398

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stacking: Add Ensemble Selection from Libraries of Models #6329

Stacking: Add Ensemble Selection from Libraries of Models #6329

GaelVaroquaux commented Feb 10, 2016

kegl commented Feb 10, 2016

MechCoder commented Feb 10, 2016

jnothman commented Feb 10, 2016

MechCoder commented Feb 10, 2016

jnothman commented Feb 11, 2016

GaelVaroquaux commented Feb 11, 2016

GaelVaroquaux commented Feb 11, 2016

MechCoder commented Feb 12, 2016

MechCoder commented Feb 12, 2016

nelson-liu commented Feb 17, 2016

MechCoder commented Feb 17, 2016

Schrodinger1926 commented Feb 17, 2016

yenchenlin commented Feb 18, 2016

Schrodinger1926 commented Feb 18, 2016

nelson-liu commented Feb 18, 2016

yenchenlin commented Feb 22, 2016

giorgiop commented Mar 5, 2016

yenchenlin commented Mar 5, 2016

jnothman commented Mar 5, 2016

x3n0cr4735 commented Mar 14, 2016

yenchenlin commented Mar 14, 2016

cmarmo commented Dec 10, 2021

GaelVaroquaux commented Dec 10, 2021

PierrickPochelu commented Oct 20, 2022 •

edited

Loading

glemaitre commented Oct 24, 2022

PierrickPochelu commented Oct 24, 2022 •

edited

Loading

Stacking: Add Ensemble Selection from Libraries of Models #6329

Stacking: Add Ensemble Selection from Libraries of Models #6329

Comments

GaelVaroquaux commented Feb 10, 2016

kegl commented Feb 10, 2016

MechCoder commented Feb 10, 2016

jnothman commented Feb 10, 2016

MechCoder commented Feb 10, 2016

jnothman commented Feb 11, 2016

GaelVaroquaux commented Feb 11, 2016

GaelVaroquaux commented Feb 11, 2016

MechCoder commented Feb 12, 2016

MechCoder commented Feb 12, 2016

nelson-liu commented Feb 17, 2016

MechCoder commented Feb 17, 2016

Schrodinger1926 commented Feb 17, 2016

yenchenlin commented Feb 18, 2016

Schrodinger1926 commented Feb 18, 2016

nelson-liu commented Feb 18, 2016

yenchenlin commented Feb 22, 2016

giorgiop commented Mar 5, 2016

yenchenlin commented Mar 5, 2016

jnothman commented Mar 5, 2016

x3n0cr4735 commented Mar 14, 2016

yenchenlin commented Mar 14, 2016

cmarmo commented Dec 10, 2021

GaelVaroquaux commented Dec 10, 2021

PierrickPochelu commented Oct 20, 2022 • edited Loading

glemaitre commented Oct 24, 2022

PierrickPochelu commented Oct 24, 2022 • edited Loading

PierrickPochelu commented Oct 20, 2022 •

edited

Loading

PierrickPochelu commented Oct 24, 2022 •

edited

Loading