Skip to content

[BUG] C++ testing failing on arm64: NEIGHBORS_ANN_NN_DESCENT_TEST #2450

Open
@jameslamb

Description

Describe the bug

Over at least the last day I've seen the NEIGHBORS_ANN_NN_DESCENT_TEST failing reproducibly and consistently in this CI job:

conda-cpp-tests / tests (arm64, 3.11, 12.0.1, ubuntu20.04, a100, latest, latest)

Like this:

/opt/conda/conda-bld/work/cpp/test/neighbors/ann_nn_descent/../ann_nn_descent.cuh:274: Failure
Value of: eval_neighbours(indices_naive, indices_NNDescent, distances_naive, distances_NNDescent, ps.n_rows, ps.graph_degree, 0.01, min_recall, true, static_cast<size_t>(ps.graph_degree * 0.1))
  Actual: false (Duplicated index 1780 at k 30 for query 194! )
Expected: true
[  FAILED  ] AnnNNDescentBatchTest/AnnNNDescentBatchTestF_U32.AnnNNDescentBatch/5, where GetParam() = dataset shape=4000x512, graph_degree=32, metric=0, host, clusters=3
 (477 ms)
...
[----------] Global test environment tear-down
[==========] 344 tests from 4 test suites ran. (153504 ms total)
[  PASSED  ] 343 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] AnnNNDescentBatchTest/AnnNNDescentBatchTestF_U32.AnnNNDescentBatch/5, where GetParam() = dataset shape=4000x512, graph_degree=32, metric=0, host, clusters=3

 1 FAILED TEST
CMake Error at run_gpu_test.cmake:34 (execute_process):
  execute_process failed command indexes:

    1: "Child return code: 1"

96% tests passed, 1 tests failed out of 24

Total Test time (real) = 4868.12 sec

The following tests FAILED:
	 22 - NEIGHBORS_ANN_NN_DESCENT_TEST (Failed)

All C++ tests appear to pass in other conda-cpp-tests jobs (which are all x86_64).

At https://github.com/rapidsai/raft/actions/workflows/pr.yaml, it looks like the most recent fully-passing run of the pr workflow that included the conda-cpp-tests jobs was 19 hours ago (build link).

I have not seen this be resolved by manual re-runs, so I don't think it's a flaky test. I think something has changed and that CI will be blocked until it's fixed.

Steps/Code to reproduce bug

Builds where I've seen that fail:

Most recent successful run:

Expected behavior

N/A

Environment details (please complete the following information):

N/A

Additional context

We did very recently update the version of fmt / spdlog across RAPIDS (#2433), but I don't have any evidence suggesting that that's the root cause.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions