Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stacking: Add Ensemble Selection from Libraries of Models #6329

Open
GaelVaroquaux opened this issue Feb 10, 2016 · 26 comments
Open

Stacking: Add Ensemble Selection from Libraries of Models #6329

GaelVaroquaux opened this issue Feb 10, 2016 · 26 comments

Comments

@GaelVaroquaux
Copy link
Member

The following paper by Rich Caruana
http://www.niculescu-mizil.org/papers/shotgun.icml04.revised.rev2.pdf
presents an simple greedy algorithm to stack models.

It is reported by many people to be an excellent strategy in data competitions. We should implement it. The paper has only 333 citations, which is on the low side for our criteria, but I hear a lot of good on it.

@kegl
Copy link

kegl commented Feb 10, 2016

+1

The method is one of the hidden treasures of ML.

@MechCoder
Copy link
Member

Interesting. What API do you propose?

The constructor will take a set of fit models during init time and at fit time do the ensemble selection that gives the best score on the data?

@jnothman
Copy link
Member

Or, if not a set of models as input, a series of (estimator, parameter
grid) pairs.

However, training 2000 models under the current single-machine parallelism
available in scikit-learn suggests perhaps that allowing pre-trained models
is also worthwhile.

I agree that if this is a useful model in practice it is simple enough to
be a real boon to our users. It can also be construed as a successful tweak
to a generic algorithm that has no shortage of citations.

On 11 February 2016 at 10:09, Manoj Kumar [email protected] wrote:

Interesting. What API do you propose?

The constructor will take a set of models during init time and at fit
time do the ensemble selection that gives the best score on the data?


Reply to this email directly or view it on GitHub
#6329 (comment)
.

@MechCoder
Copy link
Member

However, training 2000 models under the current single-machine parallelism ..

Indeed, which is why I proposed allowing fit or pre-trained models and allowing doing just the ensemble selection during fit time.

@jnothman
Copy link
Member

Aha. I'd not understood that from your fist comment. That approach invalidates its use in things like CV (because the fit models will be wiped on clone unless we find a way to work around that), and it is quite unusual for it only to receive a validation set as its fit argument, which would make it awkward in a pipeline. I'm not convinced that should be the normative usage as encapsulated by an estimator, though it may make sense for a function along the lines of ensemble_selection(fitted_estimators, test_X, test_y, scoring, ...) which returns a VotingClassifier or AveragedRegressor that can then have its predict, etc. called... This too is a bit awkward in scikit-learn land...

@GaelVaroquaux
Copy link
Member Author

I was more thinking using it with a few dozen classifiers, for which @kegl has seen good usage in practice, rather than 2000 classifiers.

Combined with the new hyper optimizer in #5491, such a feature would make it easier to build semi generic learning systems that can adapt to the data that they are given.

@GaelVaroquaux
Copy link
Member Author

Or, if not a set of models as input, a series of (estimator, parameter
grid) pairs.

This does not fit in the scikit-learn workflow, but we have found it
useful to tune hyper-parameters of the various models before giving them
to the algorithm mentionned in this issue. It doesn't fit in the
workflow, because it takes the "best_estimator" of a CV object, rather
than the CV object itself. However, I would be in favor of leaving that
problem aside for now, coding the algorithm and the object, and adivising
users to write their own code to do this step.

@MechCoder
Copy link
Member

@jnothman

Oh I just realized that clone recursively removes all "non-hyperparameter" attributes. I agree that this would make it useless in CV etc. It is also clear that the function idea is the most intuitive except that is not how we do things in sklearn ;)

What you propose also has its own shortcomings, we would have to do two fits, one the individual models on the training data and the second the ensemble fit on the validation data and we have to figure out a way in the API to handle this. (Using splitters will make it even more confusing)

Which is why just providing the fit models as parameters seems to me to be the best option as suggested by Gael. If it feels weird about the new ensemble accepting validation data notice that they speak about overfitting on the validation data and a bagged ensemble to handle this.

@MechCoder
Copy link
Member

And this may not be as important as the ensemble clasifier itself, but I can't figure out what you mean by a fist comment.

@nelson-liu
Copy link
Contributor

Is anyone working on this? I just read the paper, and it seems quite interesting; I'd love to implement it.

@MechCoder
Copy link
Member

Sure, go ahead. We can discuss the pros and cons from there.

@Schrodinger1926
Copy link

@MechCoder I've participated in some data science competitions, this can make life really easy.
I'd also like to work on this issue.

@yenchenlin
Copy link
Contributor

Hello @nelson-liu ,
I've already implemented a prototype version of this algorithm and is planning to submit a PR in two days after I refactor it.
Have you started to work on this?

Really sorry for my late replying.

@Schrodinger1926
Copy link

@yenchenlin1994 I've just started working on it, i'm quit new to open source community so moving little slow. So I'll wait for your PR and hopefully will contribute in that.

@nelson-liu
Copy link
Contributor

@yenchenlin1994 I worked a bit on it, but that is fine. Go ahead and submit a PR.

@yenchenlin
Copy link
Contributor

Hello @kegl ,
Can you provide which few dozen classifiers you used and which dataset you evaluated on?
I've completed the algorithm and want to test its performance.

@giorgiop
Copy link
Contributor

giorgiop commented Mar 5, 2016

@yenchenlin1994 did you open already a PR for this?

@yenchenlin
Copy link
Contributor

Not yet, I'm now writing examples in doc.
Being busy for my exam last week ...

@jnothman
Copy link
Member

jnothman commented Mar 5, 2016

Please submit the PR as a WIP.

On 6 March 2016 at 00:16, Yen [email protected] wrote:

Not yet, I'm now writing examples in doc.
Being busy for my exam last week ...


Reply to this email directly or view it on GitHub
#6329 (comment)
.

@x3n0cr4735
Copy link

Yes, please open the PR. I'm also new to the open source community but have had good success with stacking/blending methods and would like to see if an ensemble method as described in Caruana et al. will work in practice. Would love to contribute to this.

@yenchenlin
Copy link
Contributor

Hi all, sorry for the late reply!
I've opened a PR with a code snippet to test the implementation.
Please have a look.

@cmarmo
Copy link
Contributor

cmarmo commented Dec 10, 2021

@glemaitre you implemented stacking methods in 0.22. Should this issue be closed? Together with #6540?
The references are not the same though: Caruana et al is not cited in the stacking documentation.
Should it be? Thanks for your help.

@GaelVaroquaux
Copy link
Member Author

No, this is another method than the stacking that we have in scikit-learn. Such a method is probably complementary, with different useages. In particular, it can select a small number of model from the model library

@PierrickPochelu
Copy link

PierrickPochelu commented Oct 20, 2022

I can contribute, I have already implemented a non-scikitlearn parallel version. What is the expected API for this?

I may propose:

  • Constructor: List_of_models; score_function; max_nb_of_models; and diverse configs.
  • As input of the fit : X and Y (the calibration samples) ; nb_jobs.
  • As output of the fit: the optimized smaller list of models (<= max_nb_of_models).

@glemaitre
Copy link
Member

Regarding the API, it would be easier to check the documentation on how to develop a scikit-learn estimator: https://scikit-learn.org/dev/developers/develop.html

@PierrickPochelu
Copy link

PierrickPochelu commented Oct 24, 2022

I am developing it with the unit tests. I am inspired by the Stacking estimator and yenchenlin works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests