Open
Description
While running a particular margo test
https://github.com/Parallel-NetCDF/PnetCDF/blob/master/test/largefile/high_dim_var.c
with 4 ranks on 2 nodes, a read from rank 2 invokes a failure on the server, which generates the following logs:
023-07-06T16:00:56 tid=872735 @ signal_new_requests() [unifyfs_request_manager.c:269] signaling new requests
2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1802] RM[1511587981:1] got work
2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1631] processing 1 client requests
2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1324] processing mread[0] with 1 requests
2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:252] handling read request (1 extents)
2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:179] margo_bulk_transfer(buf_offset=0, len=1572864) failed
2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:197] failed bulk transfer - transferred 0 of 1572864 bytes
2023-07-06T16:00:56 tid=873012 @ unifyfs_invoke_find_extents_rpc() [unifyfs_p2p_rpc.c:665] failed to get bulk chunk locations
2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:279] failed to find extent locations
2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1333] unifyfs_fops_read() failed
2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1690] client rpc request 0 failed ("Mercury/Argobots operation error")
2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1768] failed to process client rpc requests
The error code returned to the client for the read is 1004. That probably corresponds to one of these:
The read error happens during PMI_File_read_at_all
which then leads to a deadlock in ROMIO:
pmodels/mpich#6585
Metadata
Assignees
Type
Projects
Status
To Consider
Activity