Unbiased mean decrease in impurity if tree-based methods

This issue is a follow-up of the PR #20058

## Background

We are aware that our current implementation of mean decrease in impurity is biased:

- it uses statistic from the training set (issue with models that overfit)
- it favours continuous feature and large cardinality categorical features

## Current solution

To overcome the issue, we proposed the following:

- implementation of the feature permutation importance. It tends to be computationally expensive but it has the advantage to be model agnostic.
- implementation of the feature permutation importance leveraging the out-of-bag samples available in the random forests: https://github.com/scikit-learn/scikit-learn/pull/18603. This method is limited to random forest (it requires out-of-bag samples).

## Public API

Currently, the biased feature importance is available via the fitted attribute `feature_importances_` in the tree-based model (decision tree, random forest, gradient-boosting decision tree).

#18603 would introduce a parameter to the random forests, `feature_importances` to be set to `{"permutation_oob", "impurity"}`, to switch between mean decrease in impurity to out-of-bag samples feature permutation importance.

## Correct the biased of mean decrease in impurity

@ZhengzeZhou proposed an alternative to the above implementation in #20058. In short, it leverages the out-of-bag samples to correct the bias of the mean decrease in impurity.

While reviewing the paper, I came across recent work tackling the same issue using the out-of-bag samples:

- https://arxiv.org/pdf/1903.05179.pdf implemented in #20058 
- https://arxiv.org/pdf/1906.10845.pdf
- https://arxiv.org/pdf/2003.02106.pdf

@ZhengzeZhou I could not yet look at all papers in-depth but you probably went across them since they are tackling the same issue as your research. Do you have any feedback regarding the proposals?

Also, all methods are quite recent and while none of the proposals are fulfilling the inclusion criterion, we might consider an implementation to solve this issue since this is problematic. However, it should be discussed in this issue before starting any pull request.

## API discussion

Unbiasing the mean decrease in impurity is requiring out-of-bag samples. Currently, only random forests can benefit from this feature.

While I worked on #18603, it seems that adding a parameter to select the feature importance was the best. In the future, we could think about deprecating the MDI computation on the training set.

However, we still have issues when we don't have out-of-bag samples at hand. I am unsure about what would be the best options for decision tree and gradient boosting decision tree classes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unbiased mean decrease in impurity if tree-based methods #20059

Background

Current solution

Public API

Correct the biased of mean decrease in impurity

API discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unbiased mean decrease in impurity if tree-based methods #20059

Description

Background

Current solution

Public API

Correct the biased of mean decrease in impurity

API discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions