Threads awaiting for GIL in Forest estimators #20666

glemaitre · 2021-08-03T13:17:12Z

Discussed in #20651

In forest algorithms, the preferred parallelization backend is threading. However, it looks that it is not anymore the most appropriate backend. As discussed here, it might be that the GIL is not explicitly released in some part of the code locking the execution of the thread.

We need to investigate more to solve this issue.

The text was updated successfully, but these errors were encountered:

jjerphan · 2021-08-03T17:00:52Z

From #20651 (reply in thread), two small optimizations could improve the scalability when using threads:

avoiding the redundant check_random_state in BaseDecisionTree.fit as it's already initialized by the forest when bootstrap to data for the tree specifically.
avoiding the assert_all_finite in _check_sample_weight in BaseDecisionTree.fit.

For reference, here is the extract of the sequential execution profiling:

ogrisel · 2021-08-04T07:56:20Z

I think threading is still the most appropriate backend by default. In #20651 the individual trees are very fast to train (4ms) and many trees are fitted (1000) which is not representative of the typical use case.

If the tree fitting time lasted 1s or more, then the GIL-holding segments of the code would be negligible and the thread-based scalability would be fine.

brianbien · 2021-08-04T12:57:51Z

not representative

I agree, it's dataset dependent. Switch out load_boston for fetch_california in my example to get a larger training set, and I see the expected speedup cross-platform.

Bhavay-2001 · 2021-08-18T17:08:18Z

hey @ogrisel @brianbien , could you please help me resolving this issue? I would like to contribute to it

ogrisel · 2021-11-02T17:00:37Z

@Bhavay192 Feel free to have a look at the code of the random forests and the decision trees to try to open a PR to solve one item of #20666 (comment) at a time to keep the PR focused.

Feel free to ask specific questions on gitter if you need more help: https://gitter.im/scikit-learn/scikit-learn

ogrisel · 2024-10-04T09:13:55Z

Before opening a PR for this issue, we should reevaluate with a free-threading build of scikit-learn on CPython 3.13.

Here are some resources to get started: https://py-free-threading.github.io/

Scikit-learn's CI also automatically publishes nightly wheels for cp313t as numpy and scipy (along with the other nightly wheels):

https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds

glemaitre added Regression Blocker labels Aug 3, 2021

ogrisel added Performance and removed Blocker Regression labels Aug 6, 2021

thomasjpfan moved this to Todo📬 in Quansight's scikit-learn Project Board Apr 15, 2022

thomasjpfan added this to Quansight's scikit-learn Project Board Apr 15, 2022

cmarmo added the module:ensemble label Sep 13, 2022

ogrisel added the free-threading PRs and issues related to support for free-threaded CPython (a.k.a. nogil or no-GIL, PEP 703) label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threads awaiting for GIL in Forest estimators #20666

Threads awaiting for GIL in Forest estimators #20666

glemaitre commented Aug 3, 2021

jjerphan commented Aug 3, 2021

ogrisel commented Aug 4, 2021

brianbien commented Aug 4, 2021 •

edited

Loading

Bhavay-2001 commented Aug 18, 2021

ogrisel commented Nov 2, 2021 •

edited

Loading

ogrisel commented Oct 4, 2024

Threads awaiting for GIL in Forest estimators #20666

Threads awaiting for GIL in Forest estimators #20666

Comments

glemaitre commented Aug 3, 2021

Discussed in #20651

jjerphan commented Aug 3, 2021

ogrisel commented Aug 4, 2021

brianbien commented Aug 4, 2021 • edited Loading

Bhavay-2001 commented Aug 18, 2021

ogrisel commented Nov 2, 2021 • edited Loading

ogrisel commented Oct 4, 2024

brianbien commented Aug 4, 2021 •

edited

Loading

ogrisel commented Nov 2, 2021 •

edited

Loading