-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP005: Resampler API #15
base: main
Are you sure you want to change the base?
Conversation
I was expecting "the fit_resample slep" to be about imbalanced data. I think subsampling is a much more common issue then trying to automatically remove outliers. Do you have a real-world application where this would help? ;) |
Indeed, as I was telling Guillaume IRL, I think that this SLEP should be about fit_resample, rather than outlier rejection, and outlier rejection should be listed as an application. I am not convinced that proposing a new mixing calls for a SLEP in itself. It's the addition of the method |
OK, so I will modify the SLEP to make it about resamplers. Basically, I should keep the part about the implementation of the pipeline with the limitations. @orausch made a good suggestion. |
@amueller This is indeed a really good point which should be demonstrated in the outlier PR. |
Co-Authored-By: glemaitre <[email protected]>
Will resamplers applied in |
Nop resampler will be implementing |
I wonder because of this 4ecc51b#diff-fa5683f6b9de871ce2af02905affbdaaR80 |
Sorry, that is a mistake on my part. They should not be applied on transform. EDIT: and accordingly, during |
Yes, this make sense. |
Currently, my PR applies resamplers on To get the behavior described above, a naive implementation would have to call each transformer after the first resampler twice: once for the fit path, where we apply resamplers, and once for the transform path, where we don't. It seems to me that, in order to do it efficiently, we would need some knowledge of what samples were added/removed by the resamplers in the pipeline. If we want to make This brings me to the next thought: does it even make sense to have resamplers in a transformer only pipeline? Is there a good usecase? One choice would be to simply disallow this behavior (similarly to how we disallow resamplers for |
So I update the SLEP toward a more I realised (even if it is obvious) that outlier rejection is unsupervised while resampling for balancing is supervised (and for binary/multiclass classification) AFAIK. In the latter case, resampler will require to validate the targets and define an API for driving the resampling (i.e. sampling_strategy in imblearn) Is this API choice should be discussed within the SLEP as well or this is more specific to a type of resampler and it will be handled later on. |
https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py#L487 We are skipping the resampler during |
@orausch I am a bit confused with your comment. When calling For Could you explain where do you think we are applying the resampling exactly, maybe I am missing something? |
Sorry, I missed that. I think there is a problem regardless.
If we do this, we have inconsistent behavior between X = [A, B, C] # A, B, C are feature vectors
y = [a, b, c]
pipe = make_pipeline(removeB, mult2) # removeB is a resampler that will remove B, mult2 is a transformer Then |
At least in the imbalanced setting use case usually you will have at the last step either a classifier or a resampler, not a transformer. I suppose that the same happens in the outlier rejection. In your example it make sense to resample also in pipeline's transform, right? But what if you transform again with a |
Some options to address this:
Or we can implement the behavior that for
Let me know if I missed something. |
@amueller @jnothman @adrinjalali @GaelVaroquaux @agramfort Any thoughts about the issue raised by @orausch #15 (comment) |
Co-Authored-By: glemaitre <[email protected]>
I don't think there's one solution which works for all usecases, here are two real world examples (and I hope they're convincing):
In the above pipeline, during
In contrast to the previous usecase, now I'd like the first step to be always Also, we should consider the fact that once we have this third type of model, we'd have at least three types of pipelines, i.e. estimator, transformer, and resampler pipelines. I feel like this fact, plus the above two usecases, would justify a parameter like For instance, if the user wants to mix these behaviors, they can put different resamplers into different pipelines, set the I haven't managed to completely think these different cases through, but I hope I could convey the idea. |
I'm not sure I buy the second use case. Why is the outlier removal in the
pipeline?
I would assume it's some simple heuristic like too big a change from
yesterday or a value way out of the range.
I haven't heard anyone use outlier detection models in practice for
something like that.
I guess you could argue that estimating the range should probably be done
on the training set only in Cross-Validierung. Though I would argue in this
use case you want a human to review the ranges and not determine them
automatically.
Actually a use-case that I think is more realistic is a user tagging
obvious outliers so it's a supervised classification problem and you use a
classifier to reject outliers before they go to the main processing
pipeline.
I'm not sure if we want to support this use-case - in the end you could
always just add another class to the main problem that is "outlier".
Sent from phone. Please excuse spelling and brevity.
…On Tue, Mar 5, 2019, 08:30 Adrin Jalali ***@***.***> wrote:
I don't think there's one solution which works for all usecases, here are
two real world examples (and I hope they're convincing):
1. In the context of FAT-ML, assume the following pipeline:
- resample to tackle class imbalance
- mutate the data for the purpose of a more "fair" data (which may or
may not touch y)
- usual transformers
- an estimator
In the above pipeline, during fit, I'd like the first two steps to be on,
and during predict, I'd like them to be off, which we can do if the first
two steps are resamplers and they're automatically out during the predict.
1. I get periodic data from a bunch of sensors installed in a factory,
and I need to do predictive maintenance. Sometimes the data is clearly off
the charts and I know from the domain knowledge that I can and should
safely ignore them. The pipeline would look like:
- exclude outliers
- usual transformers
- predict faulty behavior
In contrast to the previous usecase, now I'd like the first step to be
always on, cause I need to ignore those data points from my analysis.
Also, we should consider the fact that once we have this third type of
model, we'd have at least three types of pipelines, i.e. estimator,
transformer, and resampler pipelines. I feel like this fact, plus the above
two usecases, would justify a parameter like resample_on_transform for
the pipeline, to tune the behavior of the pipeline regarding these
resamplers as its steps. I'm not sure if it completely solves our issues,
but it may.
For instance, if the user wants to mix these behaviors, they can put
different resamplers into different pipelines, set the
resample_on_transform of each pipeline appropriately, and include those
pipelines in their main pipeline.
I haven't managed to completely think these different cases through, but I
hope I could convey the idea.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbcFoG2UyVVVX62qJC_lf3PQPax02O2ks5vTnGRgaJpZM4bZCV9>
.
|
I think @GaelVaroquaux and @glemaitre would have to weigh in what their thoughts are on slep1 vs slep5. Right now I'm leaning towards Slep1. Option A is basically saying "we don't need to address this issue", saying the current API is sufficient. I think this is unlikely to be the solution of choice. It has been the default solution because we couldn't agree on any other so far. |
Does that mean, @amueller, that you prefer its text/discussion, but perhaps not its proposals?? I'm confused by your comment. I think the core of the current proposal(s) is that there is a new kind of estimator (resampler) that can be plugged into some kind of meta-estimator, which is either a Pipeline extended to handle such things, or a specialised meta-estimator class. |
As I said in #15 (comment), I think there's two core issues: distinguishing training and test phase and returning more complex objects. Neither of these seems closely related to sampling, so I think using the framing of this SLEP is confusing. It doesn't even mention these two core changes explicitly. If we don't call it While there might be a use-case for |
I'm not really convinced it's the worst of both worlds. The semantics of
including a resampler in a pipeline can get pretty confusing in some cases,
certainly for the maintainer, but also for the user, I think. The
specialised meta-estimator makes it much more explicit. The reason to have
a specialised method as well is simply to have the meta-estimator
pluggable, and to have the object plugged into it a valid estimator
(parametrised and state set by fitting).
I'm not sure how renaming from fit_resample to fit_modify changes anything,
unless you also mean that there was a change in semantics. However, unless
I misunderstand your use of this for stacking, this would involve a
transformation of the data at test time, which is different to a more
minimal proposal here.
|
I think the semantics of having something in a pipeline that does something different during training and prediction is not that complicated, but the current scikit-learn API makes it really hard to reason about it because it doesn't distinguish training and test time transformations. I wasn't suggesting to change the semantics, but I don't think you need anything more general for stacking. The current proposal wants things not to be a resampler and a transformer at the same time, but I don't see why this would be necessary. I can send along a paper I recently worked on that is comparing several pipelining languages, and as I said in other places, for MLR this is no problem at all, and I think the semantics are perfectly clear. Mostly because the differentiate between training time and test time transformations, and because they don't have If we were not mixing the two issues of returning targets and distinguishing training and test transformation I would suggest we remove So re minimal proposal, I think a pretty minimal proposal would be to add |
I'd be interested. I agree that these issues come from the design of fit_transform, fit returning self, transform returning only Xt, etc.
And at test-time any estimators with |
No, transform (or modify) can still do arbitrary things. For stacking it would not be the identity, it would be
Yes, I agree, you still need to do all of these things. The thing is really awkward because the question is whether you'd also have a In an alternate universe, we could have Not sure if I'll get around to prototype this, I still have a book to write this week, or something. |
@orausch's implementation of Pipeline handling fit_resample, close to your proposal, is https://github.com/scikit-learn/scikit-learn/pull/13269/files/61fce479352f82b7f3d1136aac24d5598748ec73#diff-dd32b47cafa79252b520b030724ddda9. I think some things there in the implementation could be simplified. There are some things there not yet handled such as resamplers returning kwargs/props. A remaining question in your design is whether "modifiers" (please could we stick to "resamplers" for now and later propose the name change to modify?) must implement transform to be used at test time. E.g. whether the ResamplerMixin should implement an identity transform as a default? |
Yes I would say they should. I don't see why they wouldn't. Sure, we could special-case it and have the pipeline and other meta-estimators handle it, but I don't see why it would be better to put the logic there instead of putting it in the estimator. One thing that is a bit strange is that right now, is the relation of a "resampler" with estimators that only have So "resampler" to you is something that returns X and y and sample props and has different behavior during training and fitting? I'm fine using that terminology as long as we agree what it means. |
The downside of that proposal is that you can't use a If |
You can call I think it is appropriate not to use a Pipeline when you want predictions over something other than the input data. For TSNE and DBSCAN after resampling, you'd likely want to either: keep a copy of the data in the resampled space (i.e. break the pipeline before TSNE/DBSCAN), or have a way to impute the TSNE/DBSCAN embeddings back onto the original full dataset, via something like |
Using the attribute works, though having to change the code if you add something with I'm not sure if it would actually ever be an issue in practice, but conceptually it seems a bit ugly, and I find the mental model hard to describe and communicate. I think something like @orausch's implementation probably solves most issues in some way. I guess I was hoping for something that results in a simple and easy to understand overall API but that would probably require substantial backward incompatible changes and so maybe that's more appropriate for sklearn 2.0 ;) What part do you think could be simplified? The changes to pipeline are pretty small, right? |
Notes from dev meeting: @jnothman felt like my concern was unrelated to the resampling and only relates to I felt like resampling is "returning y" + "doing something else on the training set". Can you maybe say how the alignment comes in? |
A core assumption in One reason I like the verb "resample" here is because we are literally changing the set of samples that downstream estimators receive features of. (I'm intentionally distinguishing a "sample" from its features, to focus on the identity of the sample rather than the observation of it.) The fact that we need to be able to modify I hope this makes sense. |
Also: As discussed at today's meeting, I proposed the meta-estimator, |
An open question here is: is there any value in requiring resamplers to have |
I like your distinction above and requiring resamplers not to change the meaning of features and basically operate on sample identities. That makes sense as a distinct logical concept to me. However, in terms of API, I'm not sure if we're working on too narrow a solution here, since it will still not let us modify y in other areas and have distinct training and test transformations, two things that I think we are relatively certain we want. On the one hand it makes sense to me to tackle one problem after the others and have incremental solutions. However, these issues seem connected to me; as you said
So if we solve this particular case, but not the more general case, do we keep tacking on things to cover the other cases, and might we come up against issues that can not be solved (like the ones I tried to construct above)? As you said, it might be fine not to use a pipeline in some cases, I guess. |
I'm still not very aware of categories of use case for modifying y where it's not about:
If there are other cases, but they are not a cohesive category, I see no reason not to tackle them with custom meta-estimators. I like that this solution circumscribes a group of tools that do similar things, i.e. changing the set of samples used at training time; and it protects the user from doing something inappropriate with test samples. If we forbid resamplers having And I think this is separate from the need for non-equivalence between |
Yeah, I have tried to come up with other examples of modifying y that doesn't fit in these categories but couldn't really think of any. @GaelVaroquaux might have had some use-cases in mind? And I agree, it's probably separate from having non-equivalence between |
I posted a SO question today and @adrinjalali suggested that my concern is related to this SLEP. After a brief chat with him, I was convinced that I can contribute a use case where the Resampler could be helpful. Assume that we start with a dataset like so, denoted by
etc.. I would like to construct a pipeline which performs the following stages (each stage may consist of multiple steps):
I hope this is indeed on-topic for the SLEP. I'll be happy to elaborate and discuss further as you need. |
@glemaitre did you see my proposed amendments in glemaitre#4? |
@glemaitre, should I just be pushing the changes from glemaitre#4 to this branch? I see the proposal here as reasonable to implement, without touching much else in the library. The contrast between the Pipeline and the ResampledTrainer choice is still a bit irksome. Pipeline seems more natural (i.e. what users expect), but the scope of the resampling can be confusing, so I prefer the explicit ResampledTrainer. But once you've got a specialised metaestimator like |
make_pipeline(StandardScaler(), SelectKBest(), SVC()), | ||
) | ||
clf = ResampledTrainer( | ||
NaNRejector(), # removes samples containing NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this is gonna work, as in, what does the transform
of this method actually do?
Here is the SLEP regarding the Outlier Rejection API