Skip to content

CUDA unit test failures & current workarounds #1743

@bmhan12

Description

@bmhan12

Tracking failures for CUDA: #1732 (review)
This is a list of unit tests that are failing, disabled, and/or have a workaround in-place on matrix:

core_flatmap_serial

This unit test failure is broken down into two different errors (1) and (2)

(1) Error Description: One of the configurations tested uses std::string, which I think is failing when its attempted to be used on device.
Status: Disabled for now. Configurations involving std::string are disabled for CUDA: https://github.com/LLNL/axom/blob/2a7af8675710293b8c26d293aae51f17c99323c0/src/axom/core/tests/core_flatmap.hpp#L750-L757

(2) Error Description: With a pinned memory policy, batched insertion test on the flat map seems to result in either less than the expected number of insertions or a deadlock.
Status: Disabled for now:
https://github.com/LLNL/axom/blob/2a7af8675710293b8c26d293aae51f17c99323c0/src/axom/core/tests/core_flatmap_for_all.hpp#L84-L92




numerics_quadrature_serial

Error Description: This error I think results from quadrature.cpp being not quite completely ported for device-only policy. For example, in the compute_gauss_legendre_data function, axom::Array's are allocated on device but then accessed on the host:
https://github.com/LLNL/axom/blob/df7fef005ffb2c40284ef22d0be789304ab51935/src/axom/core/numerics/quadrature.cpp#L44-L52

Status: Workaround for now. Use unified memory for testing: https://github.com/LLNL/axom/blob/2a7af8675710293b8c26d293aae51f17c99323c0/src/axom/core/tests/numerics_quadrature.hpp#L100-L111




bump_cutfield
bump_topology_mapper
mir_coupled3d
mir_equiz2d (passing with workaround)
mir_equiz3d
mir_concentric_circles_cuda (passing with workaround)
mir_tutorial_simple_cuda_2 (passing with workaround)
mir_tutorial_simple_cuda_5 (passing with workaround)

Error Description: Notably, a subset of these errors (mir_equiz*, bump_topology_mapper) fail intermittently with HIP, so the whole set of failures are likely related?
Status: Some are still failing, those marked "passing" pass after a workaround using unified memory was added to conduit_memory.hpp: https://github.com/LLNL/axom/blob/2a7af8675710293b8c26d293aae51f17c99323c0/src/axom/bump/utilities/conduit_memory.hpp#L101-L108

The others fail with a message suggesting an issue with reading the baseline files from conduit: #1732 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    ReviewedTestingIssues related to testing AxombugSomething isn't workingcudaIssues related to CUDA

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions