Description
Describe the bug
Over at least the last day I've seen the NEIGHBORS_ANN_NN_DESCENT_TEST
failing reproducibly and consistently in this CI job:
conda-cpp-tests / tests (arm64, 3.11, 12.0.1, ubuntu20.04, a100, latest, latest)
Like this:
/opt/conda/conda-bld/work/cpp/test/neighbors/ann_nn_descent/../ann_nn_descent.cuh:274: Failure
Value of: eval_neighbours(indices_naive, indices_NNDescent, distances_naive, distances_NNDescent, ps.n_rows, ps.graph_degree, 0.01, min_recall, true, static_cast<size_t>(ps.graph_degree * 0.1))
Actual: false (Duplicated index 1780 at k 30 for query 194! )
Expected: true
[ FAILED ] AnnNNDescentBatchTest/AnnNNDescentBatchTestF_U32.AnnNNDescentBatch/5, where GetParam() = dataset shape=4000x512, graph_degree=32, metric=0, host, clusters=3
(477 ms)
...
[----------] Global test environment tear-down
[==========] 344 tests from 4 test suites ran. (153504 ms total)
[ PASSED ] 343 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] AnnNNDescentBatchTest/AnnNNDescentBatchTestF_U32.AnnNNDescentBatch/5, where GetParam() = dataset shape=4000x512, graph_degree=32, metric=0, host, clusters=3
1 FAILED TEST
CMake Error at run_gpu_test.cmake:34 (execute_process):
execute_process failed command indexes:
1: "Child return code: 1"
96% tests passed, 1 tests failed out of 24
Total Test time (real) = 4868.12 sec
The following tests FAILED:
22 - NEIGHBORS_ANN_NN_DESCENT_TEST (Failed)
All C++ tests appear to pass in other conda-cpp-tests
jobs (which are all x86_64).
At https://github.com/rapidsai/raft/actions/workflows/pr.yaml, it looks like the most recent fully-passing run of the pr
workflow that included the conda-cpp-tests
jobs was 19 hours ago (build link).
I have not seen this be resolved by manual re-runs, so I don't think it's a flaky test. I think something has changed and that CI will be blocked until it's fixed.
Steps/Code to reproduce bug
Builds where I've seen that fail:
- Add missing
cuda_suffixed: true
#2440 (build link) - bump NCCL floor to 2.18.1.1 #2443 (build link)
- last night's nightlies: https://github.com/rapidsai/raft/actions/runs/11009246395/job/30568482715
Most recent successful run:
- PR (19 hours ago): https://github.com/rapidsai/raft/actions/runs/11002806214/job/30604900635
- nightlies (25 hours ago): https://github.com/rapidsai/raft/actions/runs/10989906149/job/30509044364
Expected behavior
N/A
Environment details (please complete the following information):
N/A
Additional context
We did very recently update the version of fmt
/ spdlog
across RAPIDS (#2433), but I don't have any evidence suggesting that that's the root cause.
Activity