Improve multi-CTA algorithm #492

anaruse · 2024-11-25T10:46:04Z

It has been reported that when the number of search results is large, for example 100, using the multi-CTA algorithm can cause a decrease in recall. This PR is intended to alleviate this low recall issue.

close #208

copy-pr-bot · 2024-11-25T10:46:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjnolet · 2024-12-03T18:33:40Z

/ok to test

tfeher

Thanks @anaruse for this PR, it is great to see these improvements. Overall the changes look good, and the benchmarks that you have shared offline look very encouraging. I just have a few questions below.

cpp/src/neighbors/detail/cagra/search_plan.cuh

cpp/src/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh

cpp/src/neighbors/detail/cagra/search_plan.cuh

tfeher · 2024-12-05T00:18:12Z

@anaruse there are some unsigned commits, that blocks CI from testing the changes automatically. To fix this issue, could you rebase the PR?

cjnolet · 2024-12-05T05:31:33Z

/ok to test

…when the number of results is large Fix some issues Fix lower recall issue with new multi-cta algo Removing redundant code and changing some parameters Update cpp/src/neighbors/detail/cagra/search_plan.cuh Co-authored-by: Tamas Bela Feher <[email protected]> Remove an unnecessary line and satisfy clang-format

tfeher

Thanks Akira for the updates, the PR looks good to me.

cjnolet · 2024-12-05T13:12:33Z

/merge

cjnolet · 2024-12-05T13:13:04Z

/ok to test

tfeher · 2024-12-05T15:53:07Z

/ok to test

cjnolet · 2024-12-05T15:57:20Z

/merge

cjnolet · 2024-12-05T16:23:30Z

/ok to test

cjnolet · 2024-12-05T16:23:37Z

/merge

cjnolet · 2024-12-05T18:17:18Z

We are on the brink of missing code freeze for this PR. Please anyone reading this, don't click the "update" button. It inserts a merge commit, which reruns CI in its entirety and this is not needed to merge the PR. We can re-run individual flaky tests that fail without having to rerun the entire CI (the former takes minutes and the latter can take several hours).

cjnolet · 2024-12-06T05:05:22Z

@anaruse @tfeher CI seems to be running successfully for other PRs but the gtests seem to be consistently timing out for this PR. As far as I can tell, there's no updates to any of the tests, in this PR, but the timeouts don't seem flaky, they seem isolated to these changes, somehow.

We are pushing back code freeze by 1 day. Do you guys think we can still make this in time for 24.12?

Handle the case when the search result contains invalid indices when building the updated graph in add_nodes. For debugging purposes, fail if any invalid indices found; in future, we can replace RAFT_FAIL with RAFT_LOG_WARN to make the add_nodes routine more robust.

achirkin · 2024-12-09T13:50:04Z

I took the liberty to add a workaround to add_nodes, which handles the case when CAGRA search doesn't return enough valid indices. With this, the tests should fail with a descriptive message in place of the segfault.
When we find the source of the bug, we can relax the RAFT_FAIL with RAFT_LOG_WARN.

cjnolet · 2025-01-08T20:23:54Z

/ok to test

achirkin

@anaruse thank you for investigating this.
It's not new that our ANN algorithms may return invalid values under some circumstances. One example of this is IVF-PQ with a small number of probes (especially with filtering), so CAGRA multi-cta implementation won't be the first. I think it's reasonable to add a GTEST_SKIP() with an explanation comment for the case of multi-cta and dim = 1 and get this PR merged.

cpp/src/neighbors/detail/cagra/device_common.hpp

cpp/src/neighbors/detail/cagra/add_nodes.cuh

Co-authored-by: Artem M. Chirkin <[email protected]>

cjnolet · 2025-01-15T15:47:41Z

/ok to test

cjnolet · 2025-01-16T15:36:57Z

/ok to test

anaruse · 2025-01-17T06:45:24Z

Could you run the test? At least the issues related to data type have been fixed.

achirkin · 2025-01-17T06:47:46Z

/ok to test

anaruse · 2025-01-20T04:00:32Z

Although some CI tests failed, it seems that all of the tests that have failed are not related to this PR, or more accurately, CAGRA. What would you think?

cjnolet · 2025-01-23T18:49:14Z

/ok to test

cjnolet · 2025-01-24T00:32:10Z

/ok to test

cjnolet · 2025-01-25T05:23:40Z

/ok to test

achirkin · 2025-01-27T16:52:01Z

/ok to test

cjnolet · 2025-01-28T16:51:18Z

@anaruse all of the C++ test checkers seem to be failing here, which indicates the test failures are likely relared to your changes (and not just flaky tests). I also don't see these tests failing in other PRs.

anaruse · 2025-01-29T05:42:55Z

I can't tell from the logs which tests are failing, can you?

cjnolet · 2025-01-29T16:33:13Z

/ok to test

anaruse · 2025-01-29T17:15:00Z

It appears that an infinite loop was occurring in CAGRA_C_TEST and time was running out. I then found that there is a problem where an infinite loop can occur when the graph degree is small. I fixed it and now CAGRA_C_TEST should be able to run.

achirkin · 2025-01-29T17:16:36Z

/ok to test

achirkin · 2025-01-30T08:02:53Z

/ok to test

This PR is based on #492. The new multi-CTA algorithm proposed in #492 can be used to obtain good recall even with high filtering rates. However, good recall cannot be obtained unless the number of search iterations, or itopk size, one of CAGRA's search parameters, is appropriately increased according to the filtering rate. Therefore, users need to find the appropriate itopk size according to the filtering rate by trial and error, which is a pain. This PR is intended to alleviate this problem by internally calculating the filtering rate and automatically adjusting the itopk size accordingly. Authors: - Akira Naruse (https://github.com/anaruse) - Tamas Bela Feher (https://github.com/tfeher) - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #509

anaruse requested a review from a team as a code owner November 25, 2024 10:46

github-actions bot added the cpp label Nov 25, 2024

anaruse mentioned this pull request Nov 25, 2024

[FEA] Strongly filtered CAGRA #480

Open

tfeher added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Nov 25, 2024

anaruse mentioned this pull request Dec 4, 2024

Automatic adjustment of itopk size according to filtering rate #509

Merged

tfeher requested changes Dec 5, 2024

View reviewed changes

anaruse force-pushed the improved_multi_cta_algo branch from 3dce160 to 6223fd2 Compare December 5, 2024 06:58

Merge branch 'branch-24.12' into improved_multi_cta_algo

8ff6991

tfeher approved these changes Dec 5, 2024

View reviewed changes

fix style

37e26c1

Merge branch 'branch-24.12' into improved_multi_cta_algo

3665d45

cjnolet assigned anaruse Dec 5, 2024

achirkin changed the base branch from branch-24.12 to branch-25.02 December 9, 2024 09:55

achirkin added 2 commits December 9, 2024 11:13

Merge branch 'branch-25.02' into improved_multi_cta_algo

018e792

achirkin approved these changes Jan 9, 2025

View reviewed changes

cpp/src/neighbors/detail/cagra/device_common.hpp Outdated Show resolved Hide resolved

cpp/src/neighbors/detail/cagra/add_nodes.cuh Show resolved Hide resolved

Update cpp/src/neighbors/detail/cagra/device_common.hpp

192c0a9

Co-authored-by: Artem M. Chirkin <[email protected]>

Merge branch 'branch-25.02' into improved_multi_cta_algo

d19a6c4

anaruse added 2 commits January 17, 2025 13:30

Merge branch 'branch-25.02' into improved_multi_cta_algo

b5c31b3

Fixed data type issues

81e4b39

Merge branch 'branch-25.02' into improved_multi_cta_algo

cdc4bc4

Merge branch 'branch-25.02' into improved_multi_cta_algo

dd371dc

Merge branch 'branch-25.02' into improved_multi_cta_algo

e769ca7

Merge branch 'branch-25.02' into improved_multi_cta_algo

ce93427

Merge branch 'branch-25.02' into improved_multi_cta_algo

c133c8b

Fixed problem of infinite loop when graph degree is small

baa3c0c

Merge branch 'branch-25.02' into improved_multi_cta_algo

2e2b6bd

rapids-bot bot merged commit 836183e into rapidsai:branch-25.02 Jan 30, 2025
61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve multi-CTA algorithm #492

Improve multi-CTA algorithm #492

anaruse commented Nov 25, 2024

copy-pr-bot bot commented Nov 25, 2024

cjnolet commented Dec 3, 2024

tfeher left a comment

tfeher commented Dec 5, 2024

cjnolet commented Dec 5, 2024

tfeher left a comment

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

tfeher commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024 •

edited

Loading

cjnolet commented Dec 6, 2024

achirkin commented Dec 9, 2024

cjnolet commented Jan 8, 2025

achirkin left a comment

cjnolet commented Jan 15, 2025

cjnolet commented Jan 16, 2025

anaruse commented Jan 17, 2025

achirkin commented Jan 17, 2025

anaruse commented Jan 20, 2025 •

edited

Loading

cjnolet commented Jan 23, 2025

cjnolet commented Jan 24, 2025

cjnolet commented Jan 25, 2025

achirkin commented Jan 27, 2025

cjnolet commented Jan 28, 2025 •

edited

Loading

anaruse commented Jan 29, 2025

cjnolet commented Jan 29, 2025

anaruse commented Jan 29, 2025

achirkin commented Jan 29, 2025

achirkin commented Jan 30, 2025

Improve multi-CTA algorithm #492

Improve multi-CTA algorithm #492

Conversation

anaruse commented Nov 25, 2024

copy-pr-bot bot commented Nov 25, 2024

cjnolet commented Dec 3, 2024

tfeher left a comment

Choose a reason for hiding this comment

tfeher commented Dec 5, 2024

cjnolet commented Dec 5, 2024

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

tfeher commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024 • edited Loading

cjnolet commented Dec 6, 2024

achirkin commented Dec 9, 2024

cjnolet commented Jan 8, 2025

achirkin left a comment

Choose a reason for hiding this comment

cjnolet commented Jan 15, 2025

cjnolet commented Jan 16, 2025

anaruse commented Jan 17, 2025

achirkin commented Jan 17, 2025

anaruse commented Jan 20, 2025 • edited Loading

cjnolet commented Jan 23, 2025

cjnolet commented Jan 24, 2025

cjnolet commented Jan 25, 2025

achirkin commented Jan 27, 2025

cjnolet commented Jan 28, 2025 • edited Loading

anaruse commented Jan 29, 2025

cjnolet commented Jan 29, 2025

anaruse commented Jan 29, 2025

achirkin commented Jan 29, 2025

achirkin commented Jan 30, 2025

cjnolet commented Dec 5, 2024 •

edited

Loading

anaruse commented Jan 20, 2025 •

edited

Loading

cjnolet commented Jan 28, 2025 •

edited

Loading