Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical predictors in randomForest #21932

Open
markusloecher opened this issue Dec 9, 2021 · 2 comments
Open

categorical predictors in randomForest #21932

markusloecher opened this issue Dec 9, 2021 · 2 comments

Comments

@markusloecher
Copy link

Describe the workflow you want to enable

I would like to pass categorical predictors to the sklearn randomForestRegressor/Classifier without the need to dummy code them.
Its potential improvement of predictive power is significant and this idea has been implemented in most competing randomForest packages in R (e.g. ranger or randomForest) as well as the h20 library.

Describe your proposed solution

Following the documentation in ranger, I propose to add the parameter respect.unordered.factors:

Unordered categorical predictors should be handled in 3 different ways by using respect.unordered.factors: For 'ignore' all factors are regarded ordered, for 'partition' all possible 2-partitions are considered for splitting. For 'order' and 2-class classification the factor levels are ordered by their proportion falling in the second class, for regression by their mean response, as described in Hastie et al. (2009), chapter 9.2.4. The use of 'order' is recommended, as it computationally fast and can handle an unlimited number of factor levels. Note that the factors are only reordered once and not again in each split.

Describe alternatives you've considered, if relevant

No response

Additional context

The combinatorial search can be avoided in the case of binary classification or regression as shown by Breiman in his original work.

@thomasjpfan
Copy link
Member

thomasjpfan commented Dec 9, 2021

Thank you for opening this PR! The most recent work on categories in trees is at #12866 Currently the tree code is hard to maintain and I am working on a tree refactor/redesign to make it easier to add tree based features, which includes categorical features.

@markusloecher
Copy link
Author

Thanks a lot for this pointer ! I had searched the issues only for randomForest but should have broadened to trees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants