n_jobs=-1 low CPU utilization by RandomForestRegressor, ExtraTreesRegressor #20651

brianbien · 2021-08-01T19:02:33Z

brianbien
Aug 1, 2021

What kind of speedup should we expect from more CPUs with RandomForestRegressor? I seem to get a max of ~2x across 3 environments I've tested, up to 64 cores. I think it relates to the preference for threads:

scikit-learn/sklearn/ensemble/_forest.py

Lines 381 to 393 in 2beed55

    
           # Parallel loop: we prefer the threading backend as the Cython code 
        
           # for fitting the trees is internally releasing the Python GIL 
        
           # making threading more efficient than multiprocessing in 
        
           # that case. However, for joblib 0.12+ we respect any 
        
           # parallel_backend contexts set at a higher level, 
        
           # since correctness does not rely on using threads. 
        
           trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, 
        
                            **_joblib_parallel_args(prefer='threads'))( 
        
               delayed(_parallel_build_trees)( 
        
                   t, self, X, y, sample_weight, i, len(trees), 
        
                   verbose=self.verbose, class_weight=self.class_weight, 
        
                   n_samples_bootstrap=n_samples_bootstrap) 
        
               for i, t in enumerate(trees))

RandomForestRegressor (or ExtraTreesRegressor) is showing a much smaller multiprocessing speedup than I would expect if cpu parallelization were fully taken advantage of.

For comparison, in any environment, I get the expected multi-core speedup with other sklearn multiprocess calls (cross_val_score or BaggingRegressor both scale up ~proportionally in speed with n_jobs=-1 to the number of physical cores.

Mac:

# Bagging seconds
12.351433992385864
42.74960994720459
# RF seconds
23.98970103263855
41.04733204841614

Windows:

# Bagging seconds
11.813002586364746
50.044994831085205
# RF seconds
24.568671226501465
64.54496479034424

Test script

from multiprocessing import cpu_count
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
import time

print(cpu_count())

X, y = load_boston(return_X_y=True)

N_EST = 16000

bag_multi = BaggingRegressor(DecisionTreeRegressor(), n_estimators=N_EST, n_jobs=-1)
bag = BaggingRegressor(DecisionTreeRegressor(), n_estimators=N_EST, n_jobs=1)
rf_multi = RandomForestRegressor(n_estimators=N_EST, n_jobs=-1)
rf = RandomForestRegressor(n_estimators=N_EST, n_jobs=1)

def fit(est, X, y):
    begin = time.time()
    est.fit(X, y)
    print(time.time() - begin)

fit(bag_multi, X, y)
fit(bag, X, y)
fit(rf_multi, X, y)
fit(rf, X, y)

glemaitre · 2021-08-03T07:25:31Z

glemaitre
Aug 3, 2021
Maintainer

As you see in the snippet of code of the RandomForestRegressor, we prefer to use threads as a backend while the BaggingRegressor will use loky by default. Indeed, it seems that this is not the best setting in the current benchmark even if we release the GIL in the decision trees. Here, are some additional benchmarks script to spot the effect of the backend.

# %%
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

# %%
from joblib import parallel_backend
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

n_estimators = 1_000
bagged_decision_trees = BaggingRegressor(DecisionTreeRegressor(), n_estimators=n_estimators, n_jobs=-1)
random_forest = RandomForestRegressor(n_estimators=n_estimators, n_jobs=-1)

# %%
%%timeit
with parallel_backend('threading', n_jobs=2):
    bagged_decision_trees.fit(X, y)

# %%
%%timeit
with parallel_backend('loky', n_jobs=2):
    bagged_decision_trees.fit(X, y)

# %%
%%timeit
with parallel_backend('threading', n_jobs=2):
    random_forest.fit(X, y)

# %%
%%timeit
with parallel_backend('loky', n_jobs=2):
    random_forest.fit(X, y)

5 replies

glemaitre Aug 3, 2021
Maintainer

However, we should probably investigate more and reconsider the prefered backend maybe.
Regarding the expected speed-up, there is always an overhead to manage the parallel execution. Thus adding new cores will look like reaching a plateau. In the benchmark here, fitting a single tree will be very fast because the dataset is very small. Thus, the overhead becomes non-negligible in comparison to this fitting process. If the fit of a single tree becomes important then it will be more likely to achieve an expected behaviour.

ogrisel Aug 3, 2021
Maintainer

There might be something holding the GIL that should not be holding it in the RF code. Maybe some redundant input validation at the tree level?

ogrisel Aug 3, 2021
Maintainer

I did another quick test on a machine with 8 cores (4 fast cores and 4 slower cores):

https://gist.github.com/ogrisel/6485c94d47edabab9447ce5660d4d860

The (default) threading backend is the fastest but not by a big margin. I get a 3.7x speed up with 4 cores which is quite good.

Still it would be interesting to use viztracer to profile the RF code with the threading backend just to be sure.

ogrisel Aug 3, 2021
Maintainer

Feel free to re-run the above notebook on your own machine and re-upload the results as a new gist if you find unexpected results.

ogrisel Aug 3, 2021
Maintainer

I did a similar run with make_regression / RandomForestRegressor and obtained similar results.

jjerphan · 2021-08-03T10:44:06Z

jjerphan
Aug 3, 2021
Maintainer

On GNU/Linux, one can have a look at the way threads are scheduled and identify if the GIL gets locked using perf(1) jointly with viztracer as described in this blog post (thank you @maartenbreddels!).

Using this setup on this script:

# rf_debug.py
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor

def main(args=None):
    X, y = load_boston(return_X_y=True)

    rf = RandomForestRegressor(n_estimators=1000, n_jobs=4)
    rf.fit(X, y)


if __name__ == "__main__":
    main()

by running:

giltracer --state-detect rf_debug.py

One get several reports, which can be visualized with Perfetto.

Summary of threads interaction with the GIL

In what follows:

the first PID annotated with * correspond to the process main thread
the last 3 PIDs correspond to threads spawned by multiprocessing.pool.ThreadPool to respectively handle workers, tasks and results.
the others correspond to ones of the pool for the tasks (hence executing user code in parallel)

I have not been able to get results for n_jobs=8

For n_jobs=2:

Summary of threads:                                                                                                                                                                                
                                                                                                                                                                                                   
PID         total(us)    no gil%✅    has gil%❗    gil wait%❌
--------  -----------  -----------  ------------  -------------
1008137*    2946911.0         94.3           0.0            5.7
1008157     1839472.8         69.5           0.0           30.5
1008158     1837844.7         68.9           0.0           31.1
1008159     1836814.2        100.0           0.0            0.0
1008160     1835098.8         87.6           0.0           12.4
1008161     1833179.0         88.2           0.0           11.8
High 'no gil' is good (✅), we like low 'has gil' (❗), and we don't want 'gil wait' (❌). (* indicates main thread)

For n_jobs=4:

Summary of threads:                                                                                 

PID         total(us)    no gil%✅    has gil%❗    gil wait%❌
--------  -----------  -----------  ------------  -------------
1008369*    6043680.7         92.9           0.2            7.0
1008389     4137123.9         42.0           0.3           57.6
1008390     4113615.0         40.0           0.7           59.3
1008391     4137909.1         41.3           0.7           58.0
1008392     4124056.3         40.8           0.5           58.6
1008393     4122098.7         99.6           0.0            0.4
1008394     4118218.1         86.0           0.0           14.0
1008395     4117109.4         79.7           0.0           20.3

High 'no gil' is good (✅), we like low 'has gil' (❗), and we don't want 'gil wait' (❌). (* indicates main thread)

Zooming on the kernel's scheduler profiling (exported in schedtracer.json), we see that threads are waiting for the GIL:

Overview:

Zoom:

Probably, some Python code prior to the Cython implementation takes the GIL in _parallel_build_trees, in BaseDecisionTree.fit or even after in Cython in _build_pruned_tree_ccp and subsequent functions which do not explicitly release the GIL and likely are interacting with Python objects.

7 replies

jjerphan Aug 3, 2021
Maintainer

The pruning strategy runtime is negligible compared to Cython code's which release the GIL and to other parts' of the parallelized code which are written in Python and which do not necessarily, namely:

the bootstrapping process for each tree of the forest and the checks:

scikit-learn/sklearn/ensemble/_forest.py

Lines 164 to 183 in d820635

    
           if forest.bootstrap: 
        
               n_samples = X.shape[0] 
        
               if sample_weight is None: 
        
                   curr_sample_weight = np.ones((n_samples,), dtype=np.float64) 
        
               else: 
        
                   curr_sample_weight = sample_weight.copy() 
        
               indices = _generate_sample_indices( 
        
                   tree.random_state, n_samples, n_samples_bootstrap 
        
               ) 
        
               sample_counts = np.bincount(indices, minlength=n_samples) 
        
               curr_sample_weight *= sample_counts 
        
               if class_weight == "subsample": 
        
                   with catch_warnings(): 
        
                       simplefilter("ignore", DeprecationWarning) 
        
                       curr_sample_weight *= compute_sample_weight("auto", y, indices=indices) 
        
               elif class_weight == "balanced_subsample": 
        
                   curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices)

)

the RandomState initialization in BaseDecisionTree.fit

scikit-learn/sklearn/tree/_classes.py

Line 154 in 1f613e7

random_state = check_random_state(self.random_state)
the handling of sample weights also in in BaseDecisionTree.fit

scikit-learn/sklearn/tree/_classes.py

Line 321 in 1f613e7

sample_weight = _check_sample_weight(sample_weight, X, DOUBLE)

scikit-learn/sklearn/tree/_classes.py

Line 333 in 1f613e7

min_weight_leaf = self.min_weight_fraction_leaf * np.sum(sample_weight)

This can be seen zooming on the profiling:

jjerphan Aug 3, 2021
Maintainer

Here is a zoom on the sequential run profiling:

The overhead of the GIL acquiring Python code is roughly one fifth of the total fit time, the Cython code releasing the GIL taking the rest of the time.

Therefore, it would be hard to have this code scale when using threads for this configuration (large number of estimators, and relative short individual tree building time).

ogrisel Aug 4, 2021
Maintainer

Thanks for the analysis @jjerphan.

If the non-parallelisable segments of the code (parts written in Python that need the GIL) take 20% of the time and the GIL-releasing section 80% of the time, then the maximum speed-up one can expect with 64 threads is 4.7x instead of 64x if 100% of the code was GIL-free:

>>> 1 / ( 0.8 / 64 + 0.2)
4.705882352941176

@brianbien your use case is a bit peculiar because the parallel tree building section is very short (small dataset) while the number of trees is unusually large. As @jjerphan stated in #20666, one could probably optimize the redundant GIL-dependent sections of the code a bit but that would not be a complete game changer for your case. Assuming we divide the GIL-dependent section by 2 one would get a 8.8x speed-up with 64 threads:

>>> 1 / ( 0.9 / 64 + 0.1)
8.767123287671232

For common cases where the tree fitting section last 1 s or more (instead of 4 ms in your case), the GIL-holding segment that last fewer than 2ms and in this case the maximum speed-up with 64 threads would be 57x:

>>> 1 / ( 0.998 / 64 + 0.002 / 1)
56.83836589698047

which would be perfectly fine.

To be pragmatic, for your use-case, I would just manually control the backend to use processes (loky) instead of threads, as done by @glemaitre above #20651 (comment).

jjerphan Aug 4, 2021
Maintainer

Thanks for the analysis @jjerphan.

Thanks for your insights, the comments I posted above come from our joint analysis. :)

brianbien Aug 4, 2021
Author

your use case is a bit peculiar because the parallel tree building section is very short (small dataset) while the number of trees is unusually large

Thanks for taking a deep look.

Note that I originally discovered the issue while working with a large dataset, 64-cpu machine and small-ish (500) number of trees. The toy example I provided with a large number of trees was intended to minimize the relative overhead potentially associated with spawning new processes.

glemaitre · 2021-08-03T13:12:58Z

glemaitre
Aug 3, 2021
Maintainer

I'm going to convert back this discussion to an issue such that we investigate the core reason for the GIL to no be released.

1 reply

jjerphan Aug 3, 2021
Maintainer

#20666 is the associated issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

n_jobs=-1 low CPU utilization by RandomForestRegressor, ExtraTreesRegressor #20651

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

n_jobs=-1 low CPU utilization by RandomForestRegressor, ExtraTreesRegressor #20651

brianbien Aug 1, 2021

Test script

Replies: 3 comments · 13 replies

glemaitre Aug 3, 2021 Maintainer

glemaitre Aug 3, 2021 Maintainer

ogrisel Aug 3, 2021 Maintainer

ogrisel Aug 3, 2021 Maintainer

ogrisel Aug 3, 2021 Maintainer

ogrisel Aug 3, 2021 Maintainer

jjerphan Aug 3, 2021 Maintainer

Summary of threads interaction with the GIL

jjerphan Aug 3, 2021 Maintainer

jjerphan Aug 3, 2021 Maintainer

ogrisel Aug 4, 2021 Maintainer

jjerphan Aug 4, 2021 Maintainer

brianbien Aug 4, 2021 Author

glemaitre Aug 3, 2021 Maintainer

jjerphan Aug 3, 2021 Maintainer

brianbien
Aug 1, 2021

Replies: 3 comments 13 replies

glemaitre
Aug 3, 2021
Maintainer

glemaitre Aug 3, 2021
Maintainer

ogrisel Aug 3, 2021
Maintainer

ogrisel Aug 3, 2021
Maintainer

ogrisel Aug 3, 2021
Maintainer

ogrisel Aug 3, 2021
Maintainer

jjerphan
Aug 3, 2021
Maintainer

jjerphan Aug 3, 2021
Maintainer

jjerphan Aug 3, 2021
Maintainer

ogrisel Aug 4, 2021
Maintainer

jjerphan Aug 4, 2021
Maintainer

brianbien Aug 4, 2021
Author

glemaitre
Aug 3, 2021
Maintainer

jjerphan Aug 3, 2021
Maintainer