Skip to content

[WIP] ENH : Nearest-neighbors removal of unused stats computations on fit #13331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

[WIP] ENH : Nearest-neighbors removal of unused stats computations on fit #13331

wants to merge 4 commits into from

Conversation

rmenuet
Copy link
Contributor

@rmenuet rmenuet commented Feb 28, 2019

Reference Issues/PRs

Solve #13330

What does this implement/fix? Explain your changes.

By deactivating some undocumented debugging stats: improves perf gain with n_jobs > 1 for nearest neighbors based on tree algorithms (cf. benchmark in issue) and achieves almost linear perf increase with parallelism.

Any other comments?

I can notify the 2 repos that use those stats (https://github.com/fcrimins/py_idistance/blob/master/idist.py and https://github.com/wilseypa/dataAnalysis-scripts/blob/master/dataGeneration/GeneratorProject/ScikitSeqs.py) that those stats won't be computed anymore.

@jnothman
Copy link
Member

Thanks for the pull request. Please give your PR a readable title that can be used as a commit message.

@agramfort agramfort changed the title Nn stats deprec ENH : Nearest-neighbors removal of unused stats computations on fit Mar 1, 2019
@rmenuet
Copy link
Contributor Author

rmenuet commented Mar 1, 2019

As an additional benchmark, those stats impact is still noticeable but less significant when distances are longer to compute (no stats = this PR):

tree samples dimension n_jobs no stats stats ratio
kd 100 1 19.1s 19.1s 1.0
kd 100 4 5.3s 10.1s 1.9
kd 100 40 1.0s 9.8s 9.8
kd 1000 1 187.0s 187.0s 1.0
kd 1000 4 51.6s 60.0s 1.2
kd 1000 40 10.0s 13.7s 1.4
---- ----------------- ------ -------- ------ -----
ball 100 1 11.0s 11.0s 1.0
ball 100 4 3.2s 7.7s 2.4
ball 100 40 1.0s 3.8s 3.8
ball 1000 1 120.0s 120.0s 1.0
ball 1000 4 31.8s 32.7s 1.0
ball 1000 40 3.8s 6.0s 1.6

(benchmark run on a dedicated server with timeit and 7 iterations)

@rmenuet
Copy link
Contributor Author

rmenuet commented Mar 1, 2019

Following #13330 (comment) , I am working on updating this PR so that stats can be reactivated with compilation arguments (if anybody need them in the future)

@rmenuet rmenuet changed the title ENH : Nearest-neighbors removal of unused stats computations on fit [WIP] ENH : Nearest-neighbors removal of unused stats computations on fit Mar 1, 2019
@amueller amueller added Needs Benchmarks A tag for the issues and PRs which require some benchmarks Needs work Performance labels Aug 6, 2019
@amueller
Copy link
Member

amueller commented Aug 6, 2019

Are you still working on this?

Base automatically changed from master to main January 22, 2021 10:50
@jjerphan
Copy link
Member

I would even say: are you still working on this @rmenuet? 🙂

@jjerphan
Copy link
Member

For your information, I ran some benchmarks to compare the runtime with and without those counters' increments.

The raw results are there

@thomasjpfan
Copy link
Member

For this specific benchmarking, I think we should fix the size of X with the random option and adjust n_jobs. The size of X would be based around the original issue:

    samples dimension: 100
    fit: 10k samples
    kneighbors: 10k samples

Current the n_calls is connected to public api get_n_calls, so changing n_call's behavior will be backward incompatible. I suspect this is not used often, so I would be okay with its removal in the long term.

There is no nice way to remove with deprecation. If we want to deprecate we would need to:

  1. Add a compute_stats parameter to *Tree the warns saying the default with change to False in 1.2.
  2. In 1.2, deprecation compute_stats, saying that stats will not be computed anymore and this parameter will be removed in 1.4.

@rmenuet rmenuet closed this by deleting the head repository Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants