-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Summary
#24678 introduces a modularization of Criterion to allow different criterion to be used with the same classes.
#25101 introduces a modularization of Splitter to allow different types of of splits to be computed.
Now comes the time to also modularize the Tree class. A good Tree class should enable oblique splits, causal leaf nodes (i.e. leaf nodes set differently from split nodes), quantile trees (leaf nodes set differently from split nodes) and unsupervised trees. Note another feature of causal trees is 'honesty', which should be easier to add after this issue is resolved.
Proposed improvement
We will have the following improvements:
- Refactor
tree._add_node()to set the split node and leaf node differently. - Refactor to have a 'splitptr' for
SplitRecord, which allows for generalizations of the SplitRecord. - Separate
Treeinto generic and abstract base functions forBaseTreeand specific supervised axis-aligned functions forTree
Once the changes are made, one should verify:
- If
treesubmodule's Cython code still builds (i.e.make cleanand thenpip install --verbose --no-build-isolation --editable .should not error out) - verify unit tests inside
sklearn/treeall pass - verify that the asv benchmarks do not show a performance regression.
asv continuous --verbose --split --bench RandomForest upstream/main <new_branch_name> and then for side-by-side comparison asv compare main <new_branch_name>
Reference
As discussed in #24577 , I wrote up a doc on proposed improvements to the tree submodule that would:
- make it easier for 3rd party packages to subclass existing sklearn tree code and
- make it easier for sklearn itself to make improvements to the tree code with many of the modern improvements to trees
cc: @jjerphan