You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to pass categorical predictors to the sklearn randomForestRegressor/Classifier without the need to dummy code them.
Its potential improvement of predictive power is significant and this idea has been implemented in most competing randomForest packages in R (e.g. ranger or randomForest) as well as the h20 library.
Describe your proposed solution
Following the documentation in ranger, I propose to add the parameter respect.unordered.factors:
Unordered categorical predictors should be handled in 3 different ways by using respect.unordered.factors: For 'ignore' all factors are regarded ordered, for 'partition' all possible 2-partitions are considered for splitting. For 'order' and 2-class classification the factor levels are ordered by their proportion falling in the second class, for regression by their mean response, as described in Hastie et al. (2009), chapter 9.2.4. The use of 'order' is recommended, as it computationally fast and can handle an unlimited number of factor levels. Note that the factors are only reordered once and not again in each split.
Describe alternatives you've considered, if relevant
No response
Additional context
The combinatorial search can be avoided in the case of binary classification or regression as shown by Breiman in his original work.
The text was updated successfully, but these errors were encountered:
Thank you for opening this PR! The most recent work on categories in trees is at #12866 Currently the tree code is hard to maintain and I am working on a tree refactor/redesign to make it easier to add tree based features, which includes categorical features.
Describe the workflow you want to enable
I would like to pass categorical predictors to the sklearn randomForestRegressor/Classifier without the need to dummy code them.
Its potential improvement of predictive power is significant and this idea has been implemented in most competing randomForest packages in R (e.g. ranger or randomForest) as well as the h20 library.
Describe your proposed solution
Following the documentation in ranger, I propose to add the parameter
respect.unordered.factors
:Unordered categorical predictors should be handled in 3 different ways by using
respect.unordered.factors
: For 'ignore' all factors are regarded ordered, for 'partition' all possible 2-partitions are considered for splitting. For 'order' and 2-class classification the factor levels are ordered by their proportion falling in the second class, for regression by their mean response, as described in Hastie et al. (2009), chapter 9.2.4. The use of 'order' is recommended, as it computationally fast and can handle an unlimited number of factor levels. Note that the factors are only reordered once and not again in each split.Describe alternatives you've considered, if relevant
No response
Additional context
The combinatorial search can be avoided in the case of binary classification or regression as shown by Breiman in his original work.
The text was updated successfully, but these errors were encountered: