[MRG] Adds Minimal Cost-Complexity Pruning to Decision Trees #12887

thomasjpfan · 2018-12-29T19:27:47Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR implements Minimal Cost-Complexity Pruning based on L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification and Regression Trees", Wadsworth, Belmont, CA, 1984.

Most of this implementation is the same as the literature. There are two differences:

In Breiman, r(t) is the an estimate of the probability of misclassification. This PR, uses the impurity as r(t).
The weighted number of samples is used to compute the probability of a point landing on a nodes.

A cost complexity parameter, alpha, was added to __init__ to control cost complexity pruning. The post pruning is done at the end of fit.

The code performing Minimal Cost-Complexity Pruning is mostly done in Python. The Python part produces the node ids that will become the leaves of the new subtree. These leaves are passed to a Cython function called build_pruned_tree that builds a tree. This was written in Cython since the tree building API is in Cython.

In Cython, the Stack class is used to go through the tree. Not all the fields of the StackRecord is used. This is a trade off between the code complexity of adding yet another Stack class, and being a little memory inefficient.

Currently, prune_tree is public, which allows for the following use case:

clf = DecisionTreeClassifier(alpha=0.0)
clf.fit(X, y)
clf.set_params(alpha=0.1)
clf.prune_tree()

If we prefer, we can make prune_tree private and not encourage this use case.

jnothman

Some API comments.

Some questions for the novice:

does calling prune_tree with the same alpha repeatedly return the same tree?
does calling prune_tree with increasing alpha return a strict sub-tree?

sklearn/tree/tree.py

jnothman · 2019-01-01T11:49:14Z

sklearn/tree/tree.py

@@ -510,6 +515,110 @@ def decision_path(self, X, check_input=True):
        X = self._validate_X_predict(X, check_input)
        return self.tree_.decision_path(X)

+    def prune_tree(self):


I think this needs to be called after fit automatically to facilitate cross validation etc.

I wonder if this should instead be a public function?

Currently, prune_tree is a public function that is called at the end of fit. It should work with our cross validation classes/functions.

thomasjpfan · 2019-01-01T14:44:41Z

does calling prune_tree with the same alpha repeatedly return the same tree?

As long as the original tree is the same, using the same alpha will return the same tree. I will add a test for this behavior.

does calling prune_tree with increasing alpha return a strict sub-tree?

When alpha gets high enough, the entire tree can be pruned, leaving just the root node.

NicolasHug · 2019-01-01T17:45:09Z

examples/tree/plot_cost_complexity_pruning.py

+###############################################################################
+# Plot training and test scores vs alpha
+# --------------------------------------
+# Calcuate and plot the the training scores and test accuracy scores


Just a few typos (I'll document myself on tree pruning and will try to provide mode in-depth review later)

Calcuate

the the

above "a smaller trees"

also I think you should avoid the `:math:` notation in comments

The :math: notation is currently used in other examples such as: https://github.com/scikit-learn/scikit-learn/blob/master/examples/svm/plot_svm_scale_c.py. Are we discouraging the usage of :math: in our examples?

It's OK in the docstrings since it will be rendered like regular rst by sphinx, but in the comments it is not necessary.

These comments are rendered into html: https://42001-843222-gh.circle-artifacts.com/0/doc/auto_examples/tree/plot_cost_complexity_pruning.html

Ooh ok I didn't know it worked like that, sorry

NicolasHug

A first pass of cosmestic comments.

@thomasjpfan it seems to me that in general, and unless there's a compelling reason not to, sklearn code uses (potentially long) descriptive variable names.

For example par_idx could be renamed to parent_idx.
Same for cur_alpha, cur_idx, etc.

NicolasHug · 2019-01-01T20:37:47Z

sklearn/tree/tree.py

+
+        # bubble up values to ancestor nodes
+        for idx in leaf_idicies:
+            cur_R = r_node[idx]


Avoid upper-case in variable names (same for R_diff)

NicolasHug · 2019-01-01T20:38:19Z

sklearn/tree/tree.py

+        leaves_in_subtree = np.zeros(shape=n_nodes, dtype=np.uint8)
+
+        stack = [(0, -1)]
+        while len(stack) > 0:


while stack is more pythonic (same below)

NicolasHug · 2019-01-01T20:39:58Z

sklearn/tree/tree.py

+
+        stack = [(0, -1)]
+        while len(stack) > 0:
+            node_id, parent = stack.pop()


node_idx to stay consistent with the rest of the function.
I would also suggest parent_idx instead of parent,

and the parents array could just be named parent.

NicolasHug · 2019-01-01T20:40:26Z

sklearn/tree/tree.py

+        # computes number of leaves in all branches and the overall impurity of
+        # the branch. The overall impurity is the sum of r_node in its leaves.
+        n_leaves = np.zeros(shape=n_nodes, dtype=np.int32)
+        leaf_idicies, = np.where(leaves_in_subtree)


leaf_indicies

NicolasHug · 2019-01-01T20:40:48Z

sklearn/tree/tree.py

+        r_branch[leaf_idicies] = r_node[leaf_idicies]
+
+        # bubble up values to ancestor nodes
+        for idx in leaf_idicies:


for leaf_idx in...?

NicolasHug · 2019-01-01T20:41:15Z

sklearn/tree/tree.py

+
+            # descendants of branch are not in subtree
+            stack = [cur_idx]
+            while len(stack) > 0:


while stack

NicolasHug · 2019-01-01T20:42:20Z

sklearn/tree/tree.py

+                inner_nodes[idx] = False
+                leaves_in_subtree[idx] = 0
+                in_subtree[idx] = False
+                n_left, n_right = child_l[idx], child_r[idx]


Usually n_something denotes a count of a number. Here those are just indices right?

NicolasHug · 2019-01-01T20:43:00Z

sklearn/tree/tree.py

+            leaves_in_subtree[cur_idx] = 1
+
+            # updates number of leaves
+            cur_leaves, n_leaves[cur_idx] = n_leaves[cur_idx], 0


I would propose

n_pruned_leaves = n_leaves[cur_idx] - 1 n_leaves[cur_idx] = 0

and accordingly update n_leaves[cur_idx] below

NicolasHug · 2019-01-01T20:46:03Z

sklearn/tree/tree.py

+
+            # bubble up values to ancestors
+            cur_idx = parents[cur_idx]
+            while cur_idx != _tree.TREE_LEAF:


It's a bit weird to bubble up to a leaf.

Whatever you're comparing to here should explicitly be the same value as what you used for defining the root's parent above (stack = [(0, -1)])

I would simply use while cur_idx != -1:

adrinjalali · 2019-01-01T22:13:46Z

Some random thoughts:

In the context of ensembles and random forests, the parameter needs to be exposed there as well.
It'd be nice if we knew the overhead of the pruning. Specially once the user uses it in the context of random forests, then the overhead is multiplied by the number of trees, and that times the number of fits in a grid search, might be significant. Related to that, two point come to mind:
- having some numbers related to the overhead would be nice.
- a potential warm_start for the pruning, maybe (cause the rest of fit doesn't have to be run again for different alpha values).
- contemplating moving the code to cython might be an idea.
I'm not sure if it's necessary to create a copy of the tree for the pruned one. Probably having a masked version of the tree would be optimal for trying out multiple alpha values and a potential warm_start. That also depends on how much overhead that copying has.

NicolasHug · 2019-01-01T22:39:31Z

@jnothman , to add to @thomasjpfan answers:

does calling prune_tree with the same alpha repeatedly return the same tree?

The procedure is deterministic so calling prune_tree with same alpha and same original tree will give you the same pruned tree. Also as far as I understand, tree.prune_tree(alpha) == tree.prune_tree(alpha).prune_tree(alpha).

does calling prune_tree with increasing alpha return a strict sub-tree?

A subtree yes, but not necessarily a strict one:

with slpha1 > alpha2, tree.prune_tree(alpha_1) is a subtree of tree.prune_tree(alpha_2) but they may also be equal. This is because the alpha parameter is only used as a threshold here.

NicolasHug · 2019-01-01T22:53:10Z

sklearn/tree/tree.py

+        in_subtree = np.ones(shape=n_nodes, dtype=np.bool)
+
+        cur_alpha = 0
+        while cur_alpha < self.alpha:


2 thoughts:

on the resources that I read (here and here), the very first pruning step is to remove all the pure leaves (equivalent to using alpha=0 apparently). This is not done here since cur_alpha is immediately overwritten. I wonder if this is done by default in the tree growing algorithm.

As you check for cur_alpha < self.alpha and cur_alpha is computed before the tree is pruned in the loop, this means that the alpha of the returned pruned tree will be greater than self.alpha. It would seem more natural to me to return a tree whose alpha is less than self.alpha. In any case we would need to explain how alpha is used in the docs, something like "subtrees whose scores are less than alpha are discarded. The score is computed as ..."

When alpha = 0, the number of leaves does not contribute to the cost-complexity measure, which I interpreted as "do not prune". Removing the leaves when alpha=0, will increase the cost-complexity measure.

Returning a tree whose alpha is less than self.alpha makes sense and should documented.

Removing the leaves when alpha=0, will increase the cost-complexity measure.

It cannot increase the cost-complexity: the first step is to prune (some of) the pure leaves. That is if a node N has 2 child leaves where all the samples in both leaves belong to the same class, then the first step will remove those 2 leaves and make N a leaf (which will still be pure). The process is repeated with N and its sibling if needed.

That is if a node N has 2 child leaves where all the samples in both leaves belong to the same class, then the first step will remove those 2 leaves and make N a leaf

This makes sense. I will review the tree building code to see if this can happen.

To prevent future confusion, I want to get on the same page with our definition of a pure leaf. From my understanding, a pure leaf is a leaf whose samples belong to the same class, independent of all other leaves. From reading your response, you consider two leaves to be pure if they are siblings and their samples belong to the same class. Is this correct?

From my understanding, a pure leaf is a leaf whose samples belong to the same class, independent of all other leaves

I meant this as well

I just looked at the tree code, I think we can assume that this "first step" is not needed here after all, since a node is made a leaf according to the min_impurity_decrease (or deprecated min_impurity_split) parameter.

That is, if a node is pure according to min_impurity_decrease, it will be made a leaf, and thus the example case I mentioned above (a node with 2 pure leaves) cannot exist.

jnothman · 2019-01-02T12:19:30Z

Where I was going with my questions was the idea of warm_start as well...

thomasjpfan · 2019-01-07T15:21:02Z

@adamgreenhall @jnothman

Exposing to the ensemble trees make sense.
I will do some experimenting to benchmark the overhead of pruning, in its current form and a Cython version of it. I'll post the results here.
The masked tree together with the warm_start parameter are great ideas. A masked tree would allow for the level of pruning to be adjusted without growing the tree again, which looks really nice. The current copying approach, allows for the original tree to be deleted, and the pruned tree will take up less space.

jnothman · 2019-01-08T07:33:15Z

I'm not sure if we want masking, but further pruning an existing tree might be reasonable.

NicolasHug

Thanks Thomas, last minor comments but LGTM!

NicolasHug · 2019-08-16T15:39:12Z

doc/modules/tree.rst

+:math:`t`, and its branch, :math:`T_t`, can be equal depending on
+:math:`\alpha`. We define the effective :math:`\alpha` of a node to be the
+value where they are equal, :math:`R_\alpha(T_t)=R_\alpha(t)` or
+:math:`\alpha_{eff}(t)=(R(t)-R(T_t))/(|\tilde{T}|-1)`. A non-terminal node


Suggested change

:math:`\alpha_{eff}(t)=(R(t)-R(T_t))/(|\tilde{T}|-1)`. A non-terminal node

:math:`\alpha_{eff}(t)=(R(t)-R(T_t))/(|T|-1)`. A non-terminal node

removed tilde

NicolasHug · 2019-08-16T15:41:06Z

doc/modules/tree.rst

+Minimal Cost-Complexity Pruning
+===============================
+
+Minimal cost-complexity pruning is an algorithm used to prune a tree after it


Please add ref here to L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees (Chapter 3)

NicolasHug · 2019-08-16T15:42:28Z

examples/tree/plot_cost_complexity_pruning.py

+ax.set_xlabel("alpha")
+ax.set_ylabel("accuracy")
+ax.set_title("Accuracy vs alpha for training and testing sets")
+ax.plot(ccp_alphas, train_scores, label="train", drawstyle="steps-post")


Suggested change

ax.plot(ccp_alphas, train_scores, label="train", drawstyle="steps-post")

ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")

NicolasHug · 2019-08-16T15:44:37Z

sklearn/tree/_tree.pyx

+    Tree orig_tree,
+    DOUBLE_t ccp_alpha):
+    """Build a pruned tree from the original tree by transforming the nodes in
+    leaves_in_subtree into leaves.


Remove this one??

NicolasHug · 2019-08-16T15:44:52Z

sklearn/tree/_tree.pyx

+
+    cdef:
+        UINT32_t total_items = path_finder.count
+        np.ndarray ccp_alphas = np.empty(shape=total_items,


ping but not super important I think

jnothman

Incidental comments

jnothman · 2019-05-21T08:00:59Z

doc/modules/tree.rst

+where :math:`|\tilde{T}|` is the number of terminal nodes in :math:`T` and 
+:math:`R(T)` is traditionally defined as the total misclassification rate of 
+the terminal nodes. Alternatively, scikit-learn uses the total sample weighted 
+impurity of the terminal nodes for :math:`R(T)`. As shown in the previous 


best to say "above" rather than "in the previous section" or link to it so it can withstand change.

jnothman · 2019-05-21T08:01:39Z

doc/modules/tree.rst

+:math:`t`, and its branch, :math:`T_t`, can be equal depending on 
+:math:`\alpha`. We define the effective :math:`\alpha` of a node to be the 
+value where they are equal, :math:`R_\alpha(T_t)=R_\alpha(t)` or 
+:math:`\alpha_{eff}=(R(t)-R(T_t))/(|\tilde{T}|-1)`. A non-terminal node with 


use \frac since this is not easily readable anyway.

jnothman · 2019-08-18T11:58:01Z

doc/modules/tree.rst

+===============================
+
+Minimal cost-complexity pruning is an algorithm used to prune a tree, described
+in Chapter 3 of [BRE]_. This algorithm is parameterized by :math:`\alpha\ge0`


It might be worth adding a small note to say that this is one method used to avoid over-fitting in trees.

jnothman · 2019-08-18T11:58:39Z

doc/whats_new/v0.22.rst

+:mod:`sklearn.tree`
+...................
+
+- |Feature| Adds minimal cost complexity pruning to


might be worth mentioning what the public api is... i.e. is it just ccp_alpha?

NicolasHug · 2019-08-20T13:03:27Z

Failure is unrelated and Joel's comments were addressed so I guess it's good to merge 🎉

Thanks @thomasjpfan !

jnothman · 2019-08-20T13:36:06Z

Exciting! Great work!!

- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)

- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports

- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250)

- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250) Do not add ccp_alpha to SurvivalTree, because it relies node_impurity, which is not set for SurvivalTree.

TrigonaMinima · 2020-05-14T17:53:37Z

I have a question: in the literature[1], the authors first prune the max grown tree and then prune it according to the different alpha. Following that, they either use a test set or cross validation to find the best alpha or the corresponding "best pruned tree". Here, we have selected the tree before alpha crosses the ccp_alpha. Am I right or did I miss something? Is the activity of selecting the "best" ccp_alpha left to the user?

1: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

thomasjpfan · 2020-05-14T18:17:53Z

Am I right or did I miss something? Is the activity of selecting the "best" ccp_alpha left to the user?

Yes this needs to be done with our cross-validation tools.

There is a more interesting way to do this by setting aside some of the training data for validation, in such a way that the tree can automatically find an alpha. This has not been implemented here.

LEEPEIQIN · 2020-06-06T10:10:15Z

Thank you for your great work and it benifits me a lot.
However, I am curious about the reason why you use GINI Impurity instead of misclassification rate as the cost function? Could you give a reference or any key words to let me search on google?
Thank you very much!

LEEPEIQIN · 2020-06-06T10:27:41Z

I find some further discussion in "performance learning" (Johannes Fürnkranz， Eyke Hüllermeier). p87-88.
Thank you very much.

thomasjpfan · 2020-06-06T13:31:07Z

Using the criterion allows pruning to be extended to regression trees. (the criterion for classification defaults to gini impurity)

LEEPEIQIN · 2020-06-07T05:36:47Z

Thank you very much.

thomasjpfan added 15 commits December 27, 2018 19:38

ENH: Adds cost complexity pruning

a5f295a

Merge remote-tracking branch 'upstream/master' into ccp_prune_tree

9569e9f

DOC: Update

1a554f6

DOC: Adds comments to algorithm

84dbc05

RFC: Small

5e10962

RFC: Moves some logic to cython

745cd18

DOC: More comments

c1cd149

Merge remote-tracking branch 'upstream/master' into ccp_prune_tree

90c294e

DOC: Removes unused parameter

5c36185

DOC: Rewords

4b277b9

Merge remote-tracking branch 'upstream/master' into ccp_prune_tree

b83b135

ENH: Adds support for extra trees

ffece26

DOC: Updates whats_new

b2e2a52

RFC: Makes prune_tree public

e95829f

RFC: Less diffs

c313151

jnothman reviewed Jan 1, 2019

View reviewed changes

thomasjpfan added 2 commits January 1, 2019 09:45

RFC: Moves prune_tree closer to the end of fit

fd5be88

Merge remote-tracking branch 'upstream/master' into ccp_prune_tree

2e348db

NicolasHug reviewed Jan 1, 2019

View reviewed changes

thomasjpfan added 2 commits January 1, 2019 13:48

BUG: Fix

75709a0

BUG: Fix

efe9793

NicolasHug reviewed Jan 1, 2019

View reviewed changes

RFC: Addresses code review

568eb04

thomasjpfan added 6 commits August 16, 2019 09:13

CLN Address NicolasHug's comments

7994897

CLN Refactors tests to use pruning_path

2a42e0c

TST Adds single node tree test

e8e3967

STY flake8

17b4112

TST Adds test on impurities from path

1a8f07e

DOC Adds words

17d3888

NicolasHug approved these changes Aug 16, 2019

View reviewed changes

thomasjpfan added 5 commits August 16, 2019 14:02

DOC Adds words

9b01fc8

Merge remote-tracking branch 'upstream/master' into ccp_prune_tree

073fd00

DOC Better words

1774b8c

DOC Adds docstring to ccp_pruning_path

73cdf1e

DOC Uses new standrad

a688f60

jnothman reviewed Aug 18, 2019

View reviewed changes

CLN Address joels comments

82f3aa1

NicolasHug merged commit 67c94c7 into scikit-learn:master Aug 20, 2019

thomasjpfan mentioned this pull request Apr 24, 2022

Add post-prune option for decision tree as well as randomForest and GBDT? #4630

Closed

	:math:`\alpha_{eff}(t)=(R(t)-R(T_t))/(\|\tilde{T}\|-1)`. A non-terminal node
	:math:`\alpha_{eff}(t)=(R(t)-R(T_t))/(\|T\|-1)`. A non-terminal node

	ax.plot(ccp_alphas, train_scores, label="train", drawstyle="steps-post")
	ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")

[MRG] Adds Minimal Cost-Complexity Pruning to Decision Trees #12887

[MRG] Adds Minimal Cost-Complexity Pruning to Decision Trees #12887

Conversation

thomasjpfan commented Dec 29, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Jan 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug Jan 1, 2019 • edited Loading

Choose a reason for hiding this comment

NicolasHug Jan 1, 2019 • edited Loading

Choose a reason for hiding this comment

adrinjalali commented Jan 1, 2019

NicolasHug commented Jan 1, 2019

Choose a reason for hiding this comment

thomasjpfan Jan 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 2, 2019

thomasjpfan commented Jan 7, 2019

jnothman commented Jan 8, 2019 via email

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Aug 20, 2019

jnothman commented Aug 20, 2019 via email

TrigonaMinima commented May 14, 2020 • edited Loading

thomasjpfan commented May 14, 2020

LEEPEIQIN commented Jun 6, 2020

LEEPEIQIN commented Jun 6, 2020

thomasjpfan commented Jun 6, 2020

LEEPEIQIN commented Jun 7, 2020

NicolasHug Jan 1, 2019 •

edited

Loading

NicolasHug Jan 1, 2019 •

edited

Loading

thomasjpfan Jan 7, 2019 •

edited

Loading

TrigonaMinima commented May 14, 2020 •

edited

Loading