Skip to content

Unbiased mean decrease in impurity if tree-based methods #20059

@glemaitre

Description

@glemaitre

This issue is a follow-up of the PR #20058

Background

We are aware that our current implementation of mean decrease in impurity is biased:

  • it uses statistic from the training set (issue with models that overfit)
  • it favours continuous feature and large cardinality categorical features

Current solution

To overcome the issue, we proposed the following:

  • implementation of the feature permutation importance. It tends to be computationally expensive but it has the advantage to be model agnostic.
  • implementation of the feature permutation importance leveraging the out-of-bag samples available in the random forests: ENH: OOB Permutation Importance for Random Forests #18603. This method is limited to random forest (it requires out-of-bag samples).

Public API

Currently, the biased feature importance is available via the fitted attribute feature_importances_ in the tree-based model (decision tree, random forest, gradient-boosting decision tree).

#18603 would introduce a parameter to the random forests, feature_importances to be set to {"permutation_oob", "impurity"}, to switch between mean decrease in impurity to out-of-bag samples feature permutation importance.

Correct the biased of mean decrease in impurity

@ZhengzeZhou proposed an alternative to the above implementation in #20058. In short, it leverages the out-of-bag samples to correct the bias of the mean decrease in impurity.

While reviewing the paper, I came across recent work tackling the same issue using the out-of-bag samples:

@ZhengzeZhou I could not yet look at all papers in-depth but you probably went across them since they are tackling the same issue as your research. Do you have any feedback regarding the proposals?

Also, all methods are quite recent and while none of the proposals are fulfilling the inclusion criterion, we might consider an implementation to solve this issue since this is problematic. However, it should be discussed in this issue before starting any pull request.

API discussion

Unbiasing the mean decrease in impurity is requiring out-of-bag samples. Currently, only random forests can benefit from this feature.

While I worked on #18603, it seems that adding a parameter to select the feature importance was the best. In the future, we could think about deprecating the MDI computation on the training set.

However, we still have issues when we don't have out-of-bag samples at hand. I am unsure about what would be the best options for decision tree and gradient boosting decision tree classes.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Discussion

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions