Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANN: Optimize host-side refine #1651

Merged
merged 5 commits into from
Jul 19, 2023

Conversation

achirkin
Copy link
Contributor

Prior to this change, raft's host-side implementation of raft::neighbors::refine operation uses non-optimal OpenMP thread config by default, spawning as many threads as there are available cores, even if only one thread is used (per-query parallelism with batch size one).
This change fixes that and adds a few optimizations alongside:

  • Use the number of threads = min(cores, queries)
  • Use templates to push the metric-type condition outside the distance computation loop (this should also make it easier to implement new metrics later)
  • Force tree-optimize compilation flag in the hopes compiles does the vectorization
  • Split out the host implementation in separate files to be able to compile it without nvcc.

@achirkin achirkin added 3 - Ready for Review improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 18, 2023
@achirkin achirkin self-assigned this Jul 18, 2023
@achirkin achirkin added 2 - In Progress Currenty a work in progress and removed 3 - Ready for Review cpp CMake labels Jul 18, 2023
@achirkin achirkin marked this pull request as ready for review July 18, 2023 16:58
@achirkin achirkin requested review from a team as code owners July 18, 2023 16:58
@achirkin achirkin added 3 - Ready for Review and removed 2 - In Progress Currenty a work in progress labels Jul 18, 2023
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the PR, it looks good to me (apart from the issue with the unnecessary extra compile flag)!

@achirkin
Copy link
Contributor Author

achirkin commented Jul 19, 2023

On the performance note: the new code gives some ridiculous speed up of up to x500-x1000 in the case of small batches (n_query = 1), I'm not sure why. Also one cannot rely on the perf reporting in the current prims bench, because it uses cuda events to measure the time, while the host-only refine obviously does not use cuda streams (#1653). I will bring up the necessary updates to the benchmarks in a follow-up PR.

cpp/CMakeLists.txt Outdated Show resolved Hide resolved
@cjnolet
Copy link
Member

cjnolet commented Jul 19, 2023

/merge

@rapids-bot rapids-bot bot merged commit f0e75f2 into rapidsai:branch-23.08 Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants