-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
This issue is a follow-up of the PR #20058
Background
We are aware that our current implementation of mean decrease in impurity is biased:
- it uses statistic from the training set (issue with models that overfit)
- it favours continuous feature and large cardinality categorical features
Current solution
To overcome the issue, we proposed the following:
- implementation of the feature permutation importance. It tends to be computationally expensive but it has the advantage to be model agnostic.
- implementation of the feature permutation importance leveraging the out-of-bag samples available in the random forests: ENH: OOB Permutation Importance for Random Forests #18603. This method is limited to random forest (it requires out-of-bag samples).
Public API
Currently, the biased feature importance is available via the fitted attribute feature_importances_ in the tree-based model (decision tree, random forest, gradient-boosting decision tree).
#18603 would introduce a parameter to the random forests, feature_importances to be set to {"permutation_oob", "impurity"}, to switch between mean decrease in impurity to out-of-bag samples feature permutation importance.
Correct the biased of mean decrease in impurity
@ZhengzeZhou proposed an alternative to the above implementation in #20058. In short, it leverages the out-of-bag samples to correct the bias of the mean decrease in impurity.
While reviewing the paper, I came across recent work tackling the same issue using the out-of-bag samples:
- https://arxiv.org/pdf/1903.05179.pdf implemented in add unbiased_feature_importance_ for RandomForest #20058
- https://arxiv.org/pdf/1906.10845.pdf
- https://arxiv.org/pdf/2003.02106.pdf
@ZhengzeZhou I could not yet look at all papers in-depth but you probably went across them since they are tackling the same issue as your research. Do you have any feedback regarding the proposals?
Also, all methods are quite recent and while none of the proposals are fulfilling the inclusion criterion, we might consider an implementation to solve this issue since this is problematic. However, it should be discussed in this issue before starting any pull request.
API discussion
Unbiasing the mean decrease in impurity is requiring out-of-bag samples. Currently, only random forests can benefit from this feature.
While I worked on #18603, it seems that adding a parameter to select the feature importance was the best. In the future, we could think about deprecating the MDI computation on the training set.
However, we still have issues when we don't have out-of-bag samples at hand. I am unsure about what would be the best options for decision tree and gradient boosting decision tree classes.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status