-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stacking: Add Ensemble Selection from Libraries of Models #6329
Comments
+1 The method is one of the hidden treasures of ML. |
Interesting. What API do you propose? The constructor will take a set of fit models during |
Or, if not a set of models as input, a series of (estimator, parameter However, training 2000 models under the current single-machine parallelism I agree that if this is a useful model in practice it is simple enough to On 11 February 2016 at 10:09, Manoj Kumar [email protected] wrote:
|
Indeed, which is why I proposed allowing fit or pre-trained models and allowing doing just the ensemble selection during fit time. |
Aha. I'd not understood that from your fist comment. That approach invalidates its use in things like CV (because the fit models will be wiped on clone unless we find a way to work around that), and it is quite unusual for it only to receive a validation set as its fit argument, which would make it awkward in a pipeline. I'm not convinced that should be the normative usage as encapsulated by an estimator, though it may make sense for a function along the lines of |
I was more thinking using it with a few dozen classifiers, for which @kegl has seen good usage in practice, rather than 2000 classifiers. Combined with the new hyper optimizer in #5491, such a feature would make it easier to build semi generic learning systems that can adapt to the data that they are given. |
This does not fit in the scikit-learn workflow, but we have found it |
Oh I just realized that clone recursively removes all "non-hyperparameter" attributes. I agree that this would make it useless in CV etc. It is also clear that the function idea is the most intuitive except that is not how we do things in sklearn ;) What you propose also has its own shortcomings, we would have to do two fits, one the individual models on the training data and the second the ensemble fit on the validation data and we have to figure out a way in the API to handle this. (Using splitters will make it even more confusing) Which is why just providing the fit models as parameters seems to me to be the best option as suggested by Gael. If it feels weird about the new ensemble accepting validation data notice that they speak about overfitting on the validation data and a bagged ensemble to handle this. |
And this may not be as important as the ensemble clasifier itself, but I can't figure out what you mean by a fist comment. |
Is anyone working on this? I just read the paper, and it seems quite interesting; I'd love to implement it. |
Sure, go ahead. We can discuss the pros and cons from there. |
@MechCoder I've participated in some data science competitions, this can make life really easy. |
Hello @nelson-liu , Really sorry for my late replying. |
@yenchenlin1994 I've just started working on it, i'm quit new to open source community so moving little slow. So I'll wait for your PR and hopefully will contribute in that. |
@yenchenlin1994 I worked a bit on it, but that is fine. Go ahead and submit a PR. |
Hello @kegl , |
@yenchenlin1994 did you open already a PR for this? |
Not yet, I'm now writing examples in doc. |
Please submit the PR as a WIP. On 6 March 2016 at 00:16, Yen [email protected] wrote:
|
Yes, please open the PR. I'm also new to the open source community but have had good success with stacking/blending methods and would like to see if an ensemble method as described in Caruana et al. will work in practice. Would love to contribute to this. |
Hi all, sorry for the late reply! |
@glemaitre you implemented stacking methods in 0.22. Should this issue be closed? Together with #6540? |
No, this is another method than the stacking that we have in scikit-learn. Such a method is probably complementary, with different useages. In particular, it can select a small number of model from the model library |
I can contribute, I have already implemented a non-scikitlearn parallel version. What is the expected API for this? I may propose:
|
Regarding the API, it would be easier to check the documentation on how to develop a scikit-learn estimator: https://scikit-learn.org/dev/developers/develop.html |
I am developing it with the unit tests. I am inspired by the Stacking estimator and yenchenlin works. |
The following paper by Rich Caruana
http://www.niculescu-mizil.org/papers/shotgun.icml04.revised.rev2.pdf
presents an simple greedy algorithm to stack models.
It is reported by many people to be an excellent strategy in data competitions. We should implement it. The paper has only 333 citations, which is on the low side for our criteria, but I hear a lot of good on it.
The text was updated successfully, but these errors were encountered: