Skip to content

PnetCDF test leads to margo error, which leads to hang in ROMIO #783

Open
@adammoody

Description

While running a particular margo test

https://github.com/Parallel-NetCDF/PnetCDF/blob/master/test/largefile/high_dim_var.c

with 4 ranks on 2 nodes, a read from rank 2 invokes a failure on the server, which generates the following logs:

023-07-06T16:00:56 tid=872735 @ signal_new_requests() [unifyfs_request_manager.c:269] signaling new requests
2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1802] RM[1511587981:1] got work
2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1631] processing 1 client requests
2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1324] processing mread[0] with 1 requests
2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:252] handling read request (1 extents)
2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:179] margo_bulk_transfer(buf_offset=0, len=1572864) failed
2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:197] failed bulk transfer - transferred 0 of 1572864 bytes
2023-07-06T16:00:56 tid=873012 @ unifyfs_invoke_find_extents_rpc() [unifyfs_p2p_rpc.c:665] failed to get bulk chunk locations
2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:279] failed to find extent locations
2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1333] unifyfs_fops_read() failed
2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1690] client rpc request 0 failed ("Mercury/Argobots operation error")
2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1768] failed to process client rpc requests

The error code returned to the client for the read is 1004. That probably corresponds to one of these:

https://github.com/mercury-hpc/mercury/blob/55b95f72714bb0e4e0deeedf4fd78d116ea9476a/src/mercury_core_types.h#L102-L108

The read error happens during PMI_File_read_at_all which then leads to a deadlock in ROMIO:
pmodels/mpich#6585

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions