Skip to content

PERF Cache intermediate results in decision tree estimators #18630

@alfaro96

Description

@alfaro96

I am wondering if it would be worth it to cache the intermediate results in decision tree estimators to avoid that the sum_total and weighted_n_node_samples attributes are re-computed.

In particular, these for loops will not be needed:

for p in range(start, end):
i = samples[p]
# w is originally set to be 1.0, meaning that if no sample weights
# are given, the default weight of each sample is 1.0
if sample_weight != NULL:
w = sample_weight[i]
# Count weighted class frequency for each target
for k in range(self.n_outputs):
c = <SIZE_t> self.y[i, k]
sum_total[k * self.sum_stride + c] += w
self.weighted_n_node_samples += w

and:

for p in range(start, end):
i = samples[p]
if sample_weight != NULL:
w = sample_weight[i]
for k in range(self.n_outputs):
y_ik = self.y[i, k]
w_y_ik = w * y_ik
self.sum_total[k] += w_y_ik
self.sq_sum_total += w_y_ik * y_ik
self.weighted_n_node_samples += w

and:

for p in range(start, end):
i = samples[p]
if sample_weight != NULL:
w = sample_weight[i]
for k in range(self.n_outputs):
# push method ends up calling safe_realloc, hence `except -1`
# push all values to the right side,
# since pos = start initially anyway
(<WeightedMedianCalculator> right_child[k]).push(self.y[i, k], w)
self.weighted_n_node_samples += w

It should be easy to cache these intermediate results in Stack and PriorityHeap, but I think we should discuss before addressing the issue.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions