-
-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Adds Permutation Importance #13146
[MRG] Adds Permutation Importance #13146
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to provide a meta estimator giving feature_importances_ for use the local where that's expected?
Please also consider looking at eli5 for feature parity, and perhaps testing ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm... By conducting cross validation over multiple splits, this determines feature importance for a class of model, rather than a specific model. If we are trying to inspect a specific model, surely we should not be fitting cv-many different models, but merely assessing the importance of features to prediction accuracy for the given model.
for column in columns: | ||
with _permute_column(X_test, column, random_state) as X_perm: | ||
feature_score = scoring(estimator, X_perm, y_test) | ||
permutation_importance_scores.append(baseline_score - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean when this value is negative? Do we need to clip in that case??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negative means that the model performed better with the feature permuted. This could mean that the feature should be dropped.
There is a paragraph about this in https://explained.ai/rf-importance/index.html at Figure 3(a)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I think both the docstring and the user guide should explain the meaning of negative importance.
This is correct. I will add a The CV mode isn't inspecting the model, it is using a multiple models to find the importance of the features. It is "inspecting the data". If the scope of the |
Hmmmm I had indeed thought of inspect as being about model inspection.
|
+1 for focusing first on a tool used for the single (fitted) model inspection use case. Here are alternative implementations:
Then we could think of a tool for automated feature selection using a nested cross-validation loop that can be used in Pipeline as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's so cheap to resample the individual predictions (on the permuted validation set), we should take advantage of this to recompute the mean score on many resampled predictions (bootstrap estimates of the importance). I think it's very important that the default behavior of this tool makes it natural to get bootstrap confidence intervals on the feature importance (e.g. a 2.5%-97.5% percentile interval in addition to the median importance across resampled importances.
Also, the feature importance plot in the example should use horizontal mustache/ box plots to highlight the uncertainty of this feature importance estimates:
We could even set the opacity of feature boxplots where 0 is outside of the 2.5%-97.5% range to highlight that those features are not predictive (given the others). |
Here are other interesting references that I have not carefully read yet: |
@ogrisel Thank you for all the suggestions! I will focus this PR on inspecting a single fitted model and tune the API to make it easy to get bootstrap results. |
Can I ask what your intention is in using the commit prefix RFC?
|
It’s a prefix I use to mean “REFACTOR”. |
Oh! RFC means "request for comment" to me. Try CLN for clean?
|
|
||
scores : array, shape (n_features, bootstrap_samples) | ||
Permutation importance scores | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a reference - and a user guide!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please check my and guillaume's suggestions and address the remaining comments? I'd really like to merge this.
Permutation feature importance is a model inspection technique that can be used | ||
for any `fitted` `estimator` when the data is rectangular. This is especially | ||
useful for non-linear or opaque `estimators`. The permutation feature | ||
importance is defined to be the decrease in a model score when the feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
importance is defined to be the decrease in a model score when the feature | |
importance is defined to be the decrease in a model score when a single feature |
useful for non-linear or opaque `estimators`. The permutation feature | ||
importance is defined to be the decrease in a model score when the feature | ||
value is randomly shuffled [1]_. This procedure breaks the relationship between | ||
the feature and the target, thus the drop in the model score is analogous to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the feature and the target, thus the drop in the model score is analogous to | |
the feature and the target, thus the drop in the model score is indicative of |
always important to evaluate the predictive power of a model using a held-out | ||
set (or better with cross-validation) prior to computing importances. | ||
|
||
Relation to feature importance in trees |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relation to feature importance in trees | |
Relation to impurity-based importance in trees |
--------------------------------------- | ||
|
||
Tree based models provides a different measure of own feature importances based | ||
on the mean decrease in the splitting criterion. This gives importance to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the mean decrease in the splitting criterion. This gives importance to | |
on the mean decrease in impurity (MDI, impurity meaning the splitting criterion). This gives importance to |
on the mean decrease in the splitting criterion. This gives importance to | ||
features that may not be predictive on unseen data. The permutation feature | ||
importance avoids this issue, since it can be applied to unseen data. | ||
Furthermore, the tree importance computed based on the impurity decrease of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, the tree importance computed based on the impurity decrease of | |
Furthermore, impurity-based feature importance for trees |
(need some more rewrite in the next line)
feature_names = [] | ||
for col, cats in zip(categorical_columns, ohe.categories_): | ||
for cat in cats: | ||
feature_names.append("{}_{}".format(col, cat)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed?
In this example, we compute the permutation importance on the Wisconsin | ||
breast cancer dataset using :func:`~sklearn.inspection.permutation_importance`. | ||
The :class:`~sklearn.ensemble.RandomForestClassifier` can easily get about 97% | ||
accuracy on a test dataset with a unsurprising tree impurity based feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this sentence. What's unsurprising? Maybe just remove this part?
plt.show() | ||
|
||
############################################################################## | ||
# Next, we pick a threshold to group our features into clusters and choose a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Next, we pick a threshold to group our features into clusters and choose a | |
# Next, we manually pick a threshold by visual inspection of the dendrogram to group our features into clusters and choose a |
X /= X_std | ||
|
||
lr = LinearRegression().fit(X, y) | ||
expected_importances = 2 * lr.coef_**2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment that this can be computed in closed form
def test_permutation_importance_linear_regresssion(): | ||
X, y = make_regression(n_samples=500, n_features=10, random_state=0) | ||
|
||
y -= y.mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X = scale(X)
y = scale(y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
my browser is working great for me these days.. |
I think there were only nitpicks after @ogrisels approval, so merging. |
Hooray! Great work guys! :) @jph00, check it out. |
The things that happen while you're on the ski slopes. Congrats, @thomasjpfan! |
Hi Everyone, Thanks for improving the usability for feature selection through ML. I have been trying to use from sklearn.inspection import permutation_importance but it throws an error: ImportError: cannot import name 'permutation_importance' |
It's not been released. Install the nightly build.
https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds
|
Thanks for your response @jnothman, I am planning to use it for a critical project. Is it safe to use the nightly-build yet and has it been tested for all the bugs? If not, I'll wait to use it for my next project. |
It has been tested for all the bugs... that we encountered so far. After a major version release, users may find edge-case bugs that we couldn't catch. |
This doesn't have a what's new entry!! |
Added what's new in 9a6f05e |
Feel free to tweak it. |
This permutation importance is giving me only zeroes no matter how I choose the settings. Everything else works fine, including the default importance. |
@kool7d please open an issue with code to reproduce the issue. It's likely that you have strongly correlated or uninformative features. Saying that "default importances work fine" means that they don't detect the issue. |
Does the X, y arguments of this function take into account the transformations done within a pipeline setting if a pipeline is passed as the estimator? |
Are there plans for drop-column importance implementation? |
@jjakenichol Could you open an issue with the feature request. You will probably have no answer by posting on a merged PR. Thanks |
Reference Issues/PRs
Resolves #11187
What does this implement/fix? Explain your changes.
Adds permutation importance to a
model_inspection
module.TODO
feature_importances_
when using trees.