-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Open
Labels
Description
I am wondering if it would be worth it to cache the intermediate results in decision tree estimators to avoid that the sum_total and weighted_n_node_samples attributes are re-computed.
In particular, these for loops will not be needed:
scikit-learn/sklearn/tree/_criterion.pyx
Lines 330 to 343 in 6ca9eab
| for p in range(start, end): | |
| i = samples[p] | |
| # w is originally set to be 1.0, meaning that if no sample weights | |
| # are given, the default weight of each sample is 1.0 | |
| if sample_weight != NULL: | |
| w = sample_weight[i] | |
| # Count weighted class frequency for each target | |
| for k in range(self.n_outputs): | |
| c = <SIZE_t> self.y[i, k] | |
| sum_total[k * self.sum_stride + c] += w | |
| self.weighted_n_node_samples += w |
and:
scikit-learn/sklearn/tree/_criterion.pyx
Lines 768 to 780 in 6ca9eab
| for p in range(start, end): | |
| i = samples[p] | |
| if sample_weight != NULL: | |
| w = sample_weight[i] | |
| for k in range(self.n_outputs): | |
| y_ik = self.y[i, k] | |
| w_y_ik = w * y_ik | |
| self.sum_total[k] += w_y_ik | |
| self.sq_sum_total += w_y_ik * y_ik | |
| self.weighted_n_node_samples += w |
and:
scikit-learn/sklearn/tree/_criterion.pyx
Lines 1056 to 1068 in 6ca9eab
| for p in range(start, end): | |
| i = samples[p] | |
| if sample_weight != NULL: | |
| w = sample_weight[i] | |
| for k in range(self.n_outputs): | |
| # push method ends up calling safe_realloc, hence `except -1` | |
| # push all values to the right side, | |
| # since pos = start initially anyway | |
| (<WeightedMedianCalculator> right_child[k]).push(self.y[i, k], w) | |
| self.weighted_n_node_samples += w |
It should be easy to cache these intermediate results in Stack and PriorityHeap, but I think we should discuss before addressing the issue.