Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.DistBackendError: NCCL error #715

Open
Chevolier opened this issue Nov 19, 2024 · 5 comments
Open

torch.distributed.DistBackendError: NCCL error #715

Chevolier opened this issue Nov 19, 2024 · 5 comments

Comments

@Chevolier
Copy link

I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by running

torchrun ${DISTRIBUTED_ARGS} ${WORKING_DIR}/dlrm_main.py --print_sharding_plan --model_type dnn
--epochs 1 --embedding_dim 16 --batch_size 8192 --learning_rate 0.006 --adagrad --num_embeddings 1000000000
--binary_path $binary_path --training_days 14 --valid_hour 23/00
--test_hour 23/00 --num_workers 4 --prefetch_factor 8 --save_dir $SM_WORKING_DIR

When the data downloading process takes more than 20 min, the training fails with the following error:

2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
invoke_main() invoke_main()
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.170Z
dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device) File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
2024-11-15T06:01:13.170Z
func_return = func(*args, **kwargs) ^^dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device)^
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^ ^func_return = func(*args, **kwargs)^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
2024-11-15T06:01:13.170Z
default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Error accessing endpoint. Endpoint has not been initialized.

torch version: 2.5.0
cuda version: 12.4

It seems that communication between gpus of different nodes fail after more than 20 min or more considering all initialization time. I also tested with downloading less data (downloading takes less than 20 min), the training has no problem. Also, single node with more data also has no problem. Please help, thanks a lot!

@rauteric
Copy link
Contributor

Hi. This appears to be the first error:

NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument

It would be helpful to get a full log with NCCL_DEBUG=Info and FI_LOG_LEVEL=Warn set so we can get additional information.

@Chevolier
Copy link
Author

Thanks for your time, the following is what is shown after setting the above two environment variables.

Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.11/site-packages (from google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny==2.17.0->mlflow>=2.8->sagemaker-mlflow->sagemaker->-r requirements.txt (line 8)) (0.4.1)

  | 2024-11-28T07:41:34.516Z | Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.11/site-packages (from google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny==2.17.0->mlflow>=2.8->sagemaker-mlflow->sagemaker->-r requirements.txt (line 8)) (4.7.2)
  | 2024-11-28T07:41:34.516Z | Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /opt/conda/lib/python3.11/site-packages (from pyasn1-modules>=0.2.1->google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny==2.17.0->mlflow>=2.8->sagemaker-mlflow->sagemaker->-r requirements.txt (line 8)) (0.6.1)
  | 2024-11-28T07:41:35.516Z | WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
  | 2024-11-28T07:41:37.517Z | W1128 07:41:37.335000 17676 site-packages/torch/distributed/run.py:793]
  | 2024-11-28T07:41:37.517Z | W1128 07:41:37.335000 17676 site-packages/torch/distributed/run.py:793] *****************************************
  | 2024-11-28T07:41:37.517Z | W1128 07:41:37.335000 17676 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  | 2024-11-28T07:41:37.517Z | W1128 07:41:37.335000 17676 site-packages/torch/distributed/run.py:793] *****************************************
  | 2024-11-28T07:42:10.524Z | PARAMS: (lr, batch_size, warmup_steps, decay_start, decay_steps): (0.0014, 8192, 0, 0, 0)
  | 2024-11-28T07:42:10.524Z | [rank4]:[W1128 07:42:10.724121136 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
  | 2024-11-28T07:42:10.524Z | [rank3]:[W1128 07:42:10.188455846 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
  | 2024-11-28T07:42:11.525Z | [rank2]:[W1128 07:42:10.302589390 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
  | 2024-11-28T07:42:11.525Z | [rank7]:[W1128 07:42:10.361934716 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
  | 2024-11-28T07:42:11.525Z | [rank5]:[W1128 07:42:10.421252997 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
  | 2024-11-28T07:42:11.525Z | [rank0]:[W1128 07:42:10.423724985 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
  | 2024-11-28T07:42:11.525Z | [rank6]:[W1128 07:42:10.432511628 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
  | 2024-11-28T07:42:11.525Z | algo-1:17773:17773 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.525Z | algo-1:17773:17773 [0] NCCL INFO Bootstrap : Using eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.525Z | [rank1]:[W1128 07:42:10.446575498 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
  | 2024-11-28T07:42:11.525Z | algo-1:17773:17773 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
  | 2024-11-28T07:42:11.525Z | algo-1:17773:17773 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
  | 2024-11-28T07:42:11.525Z | algo-1:17773:17773 [0] NCCL INFO cudaDriverVersion 12040
  | 2024-11-28T07:42:11.525Z | NCCL version 2.21.5+cuda12.4
  | 2024-11-28T07:42:11.525Z | algo-1:17775:17775 [2] NCCL INFO cudaDriverVersion 12040
  | 2024-11-28T07:42:11.525Z | algo-1:17775:17775 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.525Z | algo-1:17775:17775 [2] NCCL INFO Bootstrap : Using eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.525Z | algo-1:17775:17775 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
  | 2024-11-28T07:42:11.525Z | algo-1:17775:17775 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
  | 2024-11-28T07:42:11.525Z | algo-1:17778:17778 [5] NCCL INFO cudaDriverVersion 12040
  | 2024-11-28T07:42:11.525Z | algo-1:17778:17778 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.525Z | algo-1:17778:17778 [5] NCCL INFO Bootstrap : Using eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.525Z | algo-1:17777:17777 [4] NCCL INFO cudaDriverVersion 12040
  | 2024-11-28T07:42:11.525Z | algo-1:17777:17777 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.525Z | algo-1:17777:17777 [4] NCCL INFO Bootstrap : Using eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.525Z | algo-1:17778:17778 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
  | 2024-11-28T07:42:11.525Z | algo-1:17778:17778 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
  | 2024-11-28T07:42:11.525Z | algo-1:17777:17777 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
  | 2024-11-28T07:42:11.525Z | algo-1:17777:17777 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
  | 2024-11-28T07:42:11.525Z | algo-1:17776:17776 [3] NCCL INFO cudaDriverVersion 12040
  | 2024-11-28T07:42:11.525Z | algo-1:17776:17776 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.525Z | algo-1:17776:17776 [3] NCCL INFO Bootstrap : Using eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.525Z | algo-1:17780:17780 [7] NCCL INFO cudaDriverVersion 12040
  | 2024-11-28T07:42:11.525Z | algo-1:17780:17780 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.525Z | algo-1:17780:17780 [7] NCCL INFO Bootstrap : Using eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.525Z | algo-1:17776:17776 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
  | 2024-11-28T07:42:11.525Z | algo-1:17776:17776 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
  | 2024-11-28T07:42:11.525Z | algo-1:17779:17779 [6] NCCL INFO cudaDriverVersion 12040
  | 2024-11-28T07:42:11.525Z | algo-1:17779:17779 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.525Z | algo-1:17779:17779 [6] NCCL INFO Bootstrap : Using eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.525Z | algo-1:17774:17774 [1] NCCL INFO cudaDriverVersion 12040
  | 2024-11-28T07:42:11.525Z | algo-1:17774:17774 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.525Z | algo-1:17780:17780 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
  | 2024-11-28T07:42:11.525Z | algo-1:17780:17780 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
  | 2024-11-28T07:42:11.525Z | algo-1:17774:17774 [1] NCCL INFO Bootstrap : Using eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.525Z | algo-1:17779:17779 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
  | 2024-11-28T07:42:11.526Z | algo-1:17779:17779 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
  | 2024-11-28T07:42:11.526Z | algo-1:17774:17774 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
  | 2024-11-28T07:42:11.526Z | algo-1:17774:17774 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Using Libfabric version 1.22
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Using CUDA driver version 12040
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Configuring AWS-specific options
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Setting provider_filter to efa
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Internode latency set at 75.0 us
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Creating one domain per process
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Using Libfabric version 1.22
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Using CUDA driver version 12040
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Configuring AWS-specific options
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Setting provider_filter to efa
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Internode latency set at 75.0 us
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Creating one domain per process
  | 2024-11-28T07:42:11.526Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Using Libfabric version 1.22
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Using CUDA driver version 12040
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Configuring AWS-specific options
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Setting provider_filter to efa
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Internode latency set at 75.0 us
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Creating one domain per process
  | 2024-11-28T07:42:11.526Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Using Libfabric version 1.22
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Using CUDA driver version 12040
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Configuring AWS-specific options
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Setting provider_filter to efa
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Internode latency set at 75.0 us
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Creating one domain per process
  | 2024-11-28T07:42:11.526Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
  | 2024-11-28T07:42:11.526Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
  | 2024-11-28T07:42:11.526Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
  | 2024-11-28T07:42:11.526Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Using Libfabric version 1.22
  | 2024-11-28T07:42:11.526Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Using CUDA driver version 12040
  | 2024-11-28T07:42:11.526Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Configuring AWS-specific options
  | 2024-11-28T07:42:11.526Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Setting provider_filter to efa
  | 2024-11-28T07:42:11.526Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
  | 2024-11-28T07:42:11.526Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.526Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.527Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:11.527Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Internode latency set at 75.0 us
  | 2024-11-28T07:42:11.527Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Creating one domain per process
  | 2024-11-28T07:42:11.527Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Using Libfabric version 1.22
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Using CUDA driver version 12040
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Configuring AWS-specific options
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Setting provider_filter to efa
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Internode latency set at 75.0 us
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Creating one domain per process
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Using Libfabric version 1.22
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Using CUDA driver version 12040
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Configuring AWS-specific options
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Setting provider_filter to efa
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Internode latency set at 75.0 us
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Creating one domain per process
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Using Libfabric version 1.22
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Using CUDA driver version 12040
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Configuring AWS-specific options
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Setting provider_filter to efa
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Internode latency set at 75.0 us
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Creating one domain per process
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
  | 2024-11-28T07:42:11.527Z | algo-1:17778:17848 [5] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
  | 2024-11-28T07:42:11.527Z | algo-1:17777:17849 [4] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
  | 2024-11-28T07:42:11.527Z | algo-1:17776:17850 [3] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
  | 2024-11-28T07:42:11.527Z | algo-1:17779:17852 [6] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
  | 2024-11-28T07:42:11.527Z | algo-1:17780:17851 [7] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
  | 2024-11-28T07:42:11.527Z | algo-1:17774:17853 [1] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
  | 2024-11-28T07:42:11.527Z | libfabric:17775:1732779731::efa:cq:efa_cq_ibv_cq_ex_open_with_ibv_create_cq_ex():92 Unable to create extended CQ: Cannot allocate memory
  | 2024-11-28T07:42:11.527Z | libfabric:17775:1732779731::efa:cq:efa_rdm_cq_open():688 Unable to create extended CQ: Invalid argument
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] nccl_ofi_ofiutils_init_connection:274 NCCL WARN NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] nccl_net_ofi_create_plugin:259 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] NCCL INFO NET/Socket : Using [0]eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] NCCL INFO Using non-device net plugin version 0
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] NCCL INFO Using network Socket
  | 2024-11-28T07:42:11.527Z | libfabric:17778:1732779731::efa:cq:efa_cq_ibv_cq_ex_open_with_ibv_create_cq_ex():92 Unable to create extended CQ: Cannot allocate memory
  | 2024-11-28T07:42:11.527Z | libfabric:17778:1732779731::efa:cq:efa_rdm_cq_open():688 Unable to create extended CQ: Invalid argument
  | 2024-11-28T07:42:11.527Z | algo-1:17778:17848 [5] nccl_ofi_ofiutils_init_connection:274 NCCL WARN NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
  | 2024-11-28T07:42:11.527Z | algo-1:17778:17848 [5] nccl_net_ofi_create_plugin:259 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
  | 2024-11-28T07:42:11.527Z | algo-1:17778:17848 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
  | 2024-11-28T07:42:11.528Z | algo-1:17778:17848 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.528Z | algo-1:17778:17848 [5] NCCL INFO NET/Socket : Using [0]eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.528Z | algo-1:17778:17848 [5] NCCL INFO Using non-device net plugin version 0
  | 2024-11-28T07:42:11.528Z | algo-1:17778:17848 [5] NCCL INFO Using network Socket
  | 2024-11-28T07:42:11.528Z | libfabric:17777:1732779731::efa:cq:efa_cq_ibv_cq_ex_open_with_ibv_create_cq_ex():92 Unable to create extended CQ: Cannot allocate memory
  | 2024-11-28T07:42:11.528Z | libfabric:17777:1732779731::efa:cq:efa_rdm_cq_open():688 Unable to create extended CQ: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17777:17849 [4] nccl_ofi_ofiutils_init_connection:274 NCCL WARN NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17777:17849 [4] nccl_net_ofi_create_plugin:259 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
  | 2024-11-28T07:42:11.528Z | algo-1:17777:17849 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
  | 2024-11-28T07:42:11.528Z | algo-1:17777:17849 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.528Z | algo-1:17777:17849 [4] NCCL INFO NET/Socket : Using [0]eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.528Z | algo-1:17777:17849 [4] NCCL INFO Using non-device net plugin version 0
  | 2024-11-28T07:42:11.528Z | algo-1:17777:17849 [4] NCCL INFO Using network Socket
  | 2024-11-28T07:42:11.528Z | libfabric:17776:1732779731::efa:cq:efa_cq_ibv_cq_ex_open_with_ibv_create_cq_ex():92 Unable to create extended CQ: Cannot allocate memory
  | 2024-11-28T07:42:11.528Z | libfabric:17776:1732779731::efa:cq:efa_rdm_cq_open():688 Unable to create extended CQ: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17776:17850 [3] nccl_ofi_ofiutils_init_connection:274 NCCL WARN NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17776:17850 [3] nccl_net_ofi_create_plugin:259 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
  | 2024-11-28T07:42:11.528Z | algo-1:17776:17850 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
  | 2024-11-28T07:42:11.528Z | algo-1:17776:17850 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.528Z | algo-1:17776:17850 [3] NCCL INFO NET/Socket : Using [0]eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.528Z | algo-1:17776:17850 [3] NCCL INFO Using non-device net plugin version 0
  | 2024-11-28T07:42:11.528Z | algo-1:17776:17850 [3] NCCL INFO Using network Socket
  | 2024-11-28T07:42:11.528Z | libfabric:17779:1732779731::efa:cq:efa_cq_ibv_cq_ex_open_with_ibv_create_cq_ex():92 Unable to create extended CQ: Cannot allocate memory
  | 2024-11-28T07:42:11.528Z | libfabric:17779:1732779731::efa:cq:efa_rdm_cq_open():688 Unable to create extended CQ: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17779:17852 [6] nccl_ofi_ofiutils_init_connection:274 NCCL WARN NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17779:17852 [6] nccl_net_ofi_create_plugin:259 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
  | 2024-11-28T07:42:11.528Z | algo-1:17779:17852 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
  | 2024-11-28T07:42:11.528Z | algo-1:17779:17852 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.528Z | algo-1:17779:17852 [6] NCCL INFO NET/Socket : Using [0]eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.528Z | algo-1:17779:17852 [6] NCCL INFO Using non-device net plugin version 0
  | 2024-11-28T07:42:11.528Z | algo-1:17779:17852 [6] NCCL INFO Using network Socket
  | 2024-11-28T07:42:11.528Z | libfabric:17780:1732779731::efa:cq:efa_cq_ibv_cq_ex_open_with_ibv_create_cq_ex():92 Unable to create extended CQ: Cannot allocate memory
  | 2024-11-28T07:42:11.528Z | libfabric:17780:1732779731::efa:cq:efa_rdm_cq_open():688 Unable to create extended CQ: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17780:17851 [7] nccl_ofi_ofiutils_init_connection:274 NCCL WARN NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17780:17851 [7] nccl_net_ofi_create_plugin:259 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
  | 2024-11-28T07:42:11.528Z | algo-1:17780:17851 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
  | 2024-11-28T07:42:11.528Z | algo-1:17780:17851 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.528Z | libfabric:17774:1732779731::efa:cq:efa_cq_ibv_cq_ex_open_with_ibv_create_cq_ex():92 Unable to create extended CQ: Cannot allocate memory
  | 2024-11-28T07:42:11.528Z | libfabric:17774:1732779731::efa:cq:efa_rdm_cq_open():688 Unable to create extended CQ: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17774:17853 [1] nccl_ofi_ofiutils_init_connection:274 NCCL WARN NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
  | 2024-11-28T07:42:11.528Z | algo-1:17774:17853 [1] nccl_net_ofi_create_plugin:259 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
  | 2024-11-28T07:42:11.528Z | algo-1:17780:17851 [7] NCCL INFO NET/Socket : Using [0]eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.528Z | algo-1:17774:17853 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
  | 2024-11-28T07:42:11.528Z | algo-1:17780:17851 [7] NCCL INFO Using non-device net plugin version 0
  | 2024-11-28T07:42:11.528Z | algo-1:17774:17853 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
  | 2024-11-28T07:42:11.528Z | algo-1:17780:17851 [7] NCCL INFO Using network Socket
  | 2024-11-28T07:42:11.528Z | algo-1:17774:17853 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.227.37<0>
  | 2024-11-28T07:42:11.528Z | algo-1:17774:17853 [1] NCCL INFO Using non-device net plugin version 0
  | 2024-11-28T07:42:11.528Z | algo-1:17774:17853 [1] NCCL INFO Using network Socket
  | 2024-11-28T07:42:12.529Z | libfabric:17773:1732779731::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
  | 2024-11-28T07:42:12.529Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
  | 2024-11-28T07:42:12.529Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
  | 2024-11-28T07:42:12.529Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Global registrations supported
  | 2024-11-28T07:42:12.529Z | algo-1:17773:17846 [0] NCCL INFO Using non-device net plugin version 0
  | 2024-11-28T07:42:12.529Z | algo-1:17773:17846 [0] NCCL INFO Using network AWS Libfabric
  | 2024-11-28T07:42:12.529Z | algo-1:17773:17846 [0] NCCL INFO DMA-BUF is available on GPU device 0
  | 2024-11-28T07:42:12.529Z | algo-1:17780:17851 [7] NCCL INFO ncclCommInitRank comm 0x55e907bfdd40 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x66950bf773bc9d02 - Init START
  | 2024-11-28T07:42:12.529Z | algo-1:17779:17852 [6] NCCL INFO ncclCommInitRank comm 0x55e90b734130 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x66950bf773bc9d02 - Init START
  | 2024-11-28T07:42:12.529Z | algo-1:17777:17849 [4] NCCL INFO ncclCommInitRank comm 0x55817740ed10 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x66950bf773bc9d02 - Init START
  | 2024-11-28T07:42:12.529Z | algo-1:17776:17850 [3] NCCL INFO ncclCommInitRank comm 0x564b80836250 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x66950bf773bc9d02 - Init START
  | 2024-11-28T07:42:12.529Z | algo-1:17775:17847 [2] NCCL INFO ncclCommInitRank comm 0x555f3e5e0cf0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x66950bf773bc9d02 - Init START
  | 2024-11-28T07:42:12.529Z | algo-1:17778:17848 [5] NCCL INFO ncclCommInitRank comm 0x560ba6e8e160 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x66950bf773bc9d02 - Init START
  | 2024-11-28T07:42:12.529Z | algo-1:17773:17846 [0] NCCL INFO ncclCommInitRank comm 0x55addb45e7f0 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x66950bf773bc9d02 - Init START
  | 2024-11-28T07:42:12.529Z | algo-1:17774:17853 [1] NCCL INFO ncclCommInitRank comm 0x55a777617df0 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x66950bf773bc9d02 - Init START
  | 2024-11-28T07:42:12.529Z | algo-1:17780:17851 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:12.529Z | algo-1:17779:17852 [6] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:12.529Z | algo-1:17777:17849 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:12.529Z | algo-1:17778:17848 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:12.529Z | algo-1:17775:17847 [2] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:12.529Z | algo-1:17776:17850 [3] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:12.529Z | algo-1:17774:17853 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:12.529Z | algo-1:17773:17846 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
  | 2024-11-28T07:42:13.529Z | algo-1:17777:17849 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
  | 2024-11-28T07:42:13.529Z | algo-1:17777:17849 [4] NCCL INFO NVLS multicast support is not available on dev 4
  | 2024-11-28T07:42:13.529Z | algo-1:17776:17850 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
  | 2024-11-28T07:42:13.530Z | algo-1:17776:17850 [3] NCCL INFO NVLS multicast support is not available on dev 3
  | 2024-11-28T07:42:13.530Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Global registrations supported
  | 2024-11-28T07:42:13.530Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Global registrations supported
  | 2024-11-28T07:42:13.530Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Global registrations supported
  | 2024-11-28T07:42:13.530Z | algo-1:17773:17846 [0] NCCL INFO NET/OFI Global registrations supported
  | 2024-11-28T07:42:13.530Z | algo-1:17774:17853 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
  | 2024-11-28T07:42:13.530Z | algo-1:17774:17853 [1] NCCL INFO NVLS multicast support is not available on dev 1
  | 2024-11-28T07:42:13.530Z | algo-1:17778:17848 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
  | 2024-11-28T07:42:13.530Z | algo-1:17778:17848 [5] NCCL INFO NVLS multicast support is not available on dev 5
  | 2024-11-28T07:42:13.530Z | algo-1:17775:17847 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
  | 2024-11-28T07:42:13.530Z | algo-1:17775:17847 [2] NCCL INFO NVLS multicast support is not available on dev 2
  | 2024-11-28T07:42:13.530Z | algo-1:17780:17851 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
  | 2024-11-28T07:42:13.530Z | algo-1:17780:17851 [7] NCCL INFO NVLS multicast support is not available on dev 7
  | 2024-11-28T07:42:13.530Z | algo-1:17773:17846 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
  | 2024-11-28T07:42:13.530Z | algo-1:17773:17846 [0] NCCL INFO NVLS multicast support is not available on dev 0
  | 2024-11-28T07:42:13.530Z | algo-1:17779:17852 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
  | 2024-11-28T07:42:13.530Z | algo-1:17779:17852 [6] NCCL INFO NVLS multicast support is not available on dev 6
  | 2024-11-28T07:42:13.530Z | algo-1:17780:17851 [7] NCCL INFO comm 0x55e907bfdd40 rank 7 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0
  | 2024-11-28T07:42:13.530Z | algo-1:17779:17852 [6] NCCL INFO comm 0x55e90b734130 rank 6 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0
  | 2024-11-28T07:42:13.530Z | algo-1:17778:17848 [5] NCCL INFO comm 0x560ba6e8e160 rank 5 nRanks 16 nNodes 2 localRanks 8 localRank 5 MNNVL 0
  | 2024-11-28T07:42:13.530Z | algo-1:17773:17846 [0] NCCL INFO comm 0x55addb45e7f0 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
  | 2024-11-28T07:42:13.530Z | algo-1:17775:17847 [2] NCCL INFO comm 0x555f3e5e0cf0 rank 2 nRanks 16 nNodes 2 localRanks 8 localRank 2 MNNVL 0
  | 2024-11-28T07:42:13.530Z | algo-1:17776:17850 [3] NCCL INFO comm 0x564b80836250 rank 3 nRanks 16 nNodes 2 localRanks 8 localRank 3 MNNVL 0
  | 2024-11-28T07:42:13.530Z | algo-1:17777:17849 [4] NCCL INFO comm 0x55817740ed10 rank 4 nRanks 16 nNodes 2 localRanks 8 localRank 4 MNNVL 0
  | 2024-11-28T07:42:13.530Z | algo-1:17780:17851 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
  | 2024-11-28T07:42:13.530Z | algo-1:17779:17852 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
  | 2024-11-28T07:42:13.530Z | algo-1:17778:17848 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
  | 2024-11-28T07:42:13.530Z | algo-1:17780:17851 [7] NCCL INFO P2P Chunksize set to 131072
  | 2024-11-28T07:42:13.530Z | algo-1:17779:17852 [6] NCCL INFO P2P Chunksize set to 131072
  | 2024-11-28T07:42:13.531Z | algo-1:17774:17853 [1] NCCL INFO comm 0x55a777617df0 rank 1 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0
  | 2024-11-28T07:42:13.531Z | algo-1:17778:17848 [5] NCCL INFO P2P Chunksize set to 131072
  | 2024-11-28T07:42:13.531Z | algo-1:17773:17846 [0] NCCL INFO Channel 00/02 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
  | 2024-11-28T07:42:13.531Z | algo-1:17775:17847 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
  | 2024-11-28T07:42:13.531Z | algo-1:17776:17850 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
  | 2024-11-28T07:42:13.531Z | algo-1:17775:17847 [2] NCCL INFO P2P Chunksize set to 131072
  | 2024-11-28T07:42:13.531Z | algo-1:17773:17846 [0] NCCL INFO Channel 01/02 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
  | 2024-11-28T07:42:13.531Z | algo-1:17776:17850 [3] NCCL INFO P2P Chunksize set to 131072
  | 2024-11-28T07:42:13.531Z | algo-1:17777:17849 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
  | 2024-11-28T07:42:13.531Z | algo-1:17773:17846 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->8
  | 2024-11-28T07:42:13.531Z | algo-1:17777:17849 [4] NCCL INFO P2P Chunksize set to 131072
  | 2024-11-28T07:42:13.531Z | algo-1:17773:17846 [0] NCCL INFO P2P Chunksize set to 131072
  | 2024-11-28T07:42:13.531Z | algo-1:17774:17853 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
  | 2024-11-28T07:42:13.531Z | algo-1:17774:17853 [1] NCCL INFO P2P Chunksize set to 131072
  | 2024-11-28T07:42:13.531Z | algo-1:17775:17847 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17778:17848 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17776:17850 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17779:17852 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17777:17849 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17775:17847 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17776:17850 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17778:17848 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17779:17852 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17777:17849 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
  | 2024-11-28T07:42:13.531Z | algo-1:17773:17857 [0] NCCL INFO NET/OFI Global registrations supported
  | 2024-11-28T07:42:13.531Z | algo-1:17774:17853 [1] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [send] via NET/Socket/0
  | 2024-11-28T07:42:13.531Z | algo-1:17774:17853 [1] NCCL INFO Channel 01/0 : 1[1] -> 8[0] [send] via NET/Socket/0
  | 2024-11-28T07:42:13.531Z | libfabric:17773:1732779733::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
  | 2024-11-28T07:42:14.532Z | algo-1:17773:17846 [0] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA
  | 2024-11-28T07:42:14.532Z | algo-1:17773:17857 [0] NCCL INFO NET/OFI Global registrations supported
  | 2024-11-28T07:42:14.532Z | algo-1:17773:17846 [0] NCCL INFO Channel 01/0 : 9[1] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA
  | 2024-11-28T07:42:14.532Z | algo-1:17773:17846 [0] NCCL INFO Channel 00/0 : 0[0] -> 7[7] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.532Z | algo-1:17773:17846 [0] NCCL INFO Channel 01/0 : 0[0] -> 7[7] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.532Z | algo-1:17780:17851 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.532Z | algo-1:17780:17851 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.532Z | algo-1:17777:17849 [4] NCCL INFO Connected all rings
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17862 [1] misc/socket.cc:717 NCCL WARN ncclSocketInit: connecting to address with family 33022 is neither AF_INET(2) nor AF_INET6(10)
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17862 [1] NCCL INFO transport/net_socket.cc:331 -> 3
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17862 [1] NCCL INFO transport/net.cc:687 -> 3
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17862 [1] misc/socket.cc:717 NCCL WARN ncclSocketInit: connecting to address with family 33022 is neither AF_INET(2) nor AF_INET6(10)
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17862 [1] NCCL INFO transport/net_socket.cc:331 -> 3
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17862 [1] NCCL INFO transport/net.cc:687 -> 3
  | 2024-11-28T07:42:14.532Z | algo-1:17777:17849 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.532Z | algo-1:17777:17849 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17853 [1] NCCL INFO transport/net.cc:306 -> 3
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17853 [1] NCCL INFO transport.cc:165 -> 3
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17853 [1] NCCL INFO init.cc:1263 -> 3
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17853 [1] NCCL INFO init.cc:1548 -> 3
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17853 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17774 [1] NCCL INFO group.cc:418 -> 3
  | 2024-11-28T07:42:14.532Z | algo-1:17774:17774 [1] NCCL INFO init.cc:1929 -> 3
  | 2024-11-28T07:42:14.532Z | [rank1]: Traceback (most recent call last):
  | 2024-11-28T07:42:14.532Z | [rank1]: File "/opt/ml/code/dlrm_main.py", line 952, in
  | 2024-11-28T07:42:14.532Z | [rank1]: invoke_main()
  | 2024-11-28T07:42:14.532Z | [rank1]: File "/opt/ml/code/dlrm_main.py", line 949, in invoke_main
  | 2024-11-28T07:42:14.532Z | [rank1]: main(sys.argv[1:])
  | 2024-11-28T07:42:14.532Z | [rank1]: File "/opt/ml/code/dlrm_main.py", line 806, in main
  | 2024-11-28T07:42:14.533Z | [rank1]: torch.distributed.barrier()
  | 2024-11-28T07:42:14.533Z | [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
  | 2024-11-28T07:42:14.533Z | [rank1]: return func(*args, **kwargs)
  | 2024-11-28T07:42:14.533Z | [rank1]: ^^^^^^^^^^^^^^^^^^^^^
  | 2024-11-28T07:42:14.533Z | [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
  | 2024-11-28T07:42:14.533Z | [rank1]: work = group.barrier(opts=opts)
  | 2024-11-28T07:42:14.533Z | [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
  | 2024-11-28T07:42:14.533Z | [rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
  | 2024-11-28T07:42:14.533Z | [rank1]: ncclInternalError: Internal check failed.
  | 2024-11-28T07:42:14.533Z | [rank1]: Last error:
  | 2024-11-28T07:42:14.533Z | [rank1]: ncclSocketInit: connecting to address with family 33022 is neither AF_INET(2) nor AF_INET6(10)
  | 2024-11-28T07:42:14.533Z | algo-1:17778:17848 [5] NCCL INFO Connected all rings
  | 2024-11-28T07:42:14.533Z | algo-1:17778:17848 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.533Z | algo-1:17778:17848 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.533Z | algo-1:17780:17851 [7] NCCL INFO Connected all rings
  | 2024-11-28T07:42:14.533Z | algo-1:17779:17852 [6] NCCL INFO Connected all rings
  | 2024-11-28T07:42:14.533Z | algo-1:17773:17846 [0] NCCL INFO Connected all rings
  | 2024-11-28T07:42:14.533Z | algo-1:17773:17846 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.533Z | algo-1:17773:17846 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.533Z | algo-1:17779:17852 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.533Z | algo-1:17779:17852 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
  | 2024-11-28T07:42:14.533Z | algo-1:17780:17851 [7] NCCL INFO Connected all trees
  | 2024-11-28T07:42:14.533Z | algo-1:17780:17851 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
  | 2024-11-28T07:42:14.533Z | algo-1:17780:17851 [7] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
  | 2024-11-28T07:42:14.533Z | algo-1:17779:17852 [6] NCCL INFO Connected all trees
  | 2024-11-28T07:42:14.533Z | algo-1:17779:17852 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
  | 2024-11-28T07:42:14.533Z | algo-1:17779:17852 [6] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
  | 2024-11-28T07:42:15.533Z | W1128 07:42:14.608000 17676 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 17773 closing signal SIGTERM
  | 2024-11-28T07:42:15.533Z | W1128 07:42:14.609000 17676 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 17775 closing signal SIGTERM
  | 2024-11-28T07:42:15.533Z | W1128 07:42:14.609000 17676 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 17776 closing signal SIGTERM
  | 2024-11-28T07:42:15.533Z | W1128 07:42:14.609000 17676 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 17777 closing signal SIGTERM
  | 2024-11-28T07:42:15.533Z | W1128 07:42:14.609000 17676 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 17778 closing signal SIGTERM
  | 2024-11-28T07:42:15.533Z | W1128 07:42:14.610000 17676 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 17779 closing signal SIGTERM
  | 2024-11-28T07:42:15.533Z | W1128 07:42:14.610000 17676 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 17780 closing signal SIGTERM
  | 2024-11-28T07:42:16.534Z | E1128 07:42:15.588000 17676 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 17774) of binary: /opt/conda/bin/python3.11
  | 2024-11-28T07:42:16.534Z | Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in
  | 2024-11-28T07:42:16.534Z | sys.exit(main())
  | 2024-11-28T07:42:16.534Z | ^^
  | 2024-11-28T07:42:16.534Z | ^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
  | 2024-11-28T07:42:16.534Z | return f(*args, **kwargs)
  | 2024-11-28T07:42:16.534Z | ^^
  | 2024-11-28T07:42:16.534Z | ^^^^^^^^
  | 2024-11-28T07:42:16.534Z | ^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
  | 2024-11-28T07:42:16.534Z | run(args)
  | 2024-11-28T07:42:16.534Z | File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
  | 2024-11-28T07:42:16.534Z | elastic_launch(
  | 2024-11-28T07:42:16.534Z | File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
  | 2024-11-28T07:42:16.534Z | return launch_agent(self._config, self._entrypoint, list(args))
  | 2024-11-28T07:42:16.534Z | ^^^^^^^
  | 2024-11-28T07:42:16.534Z | ^^^^^^^^^^^^^^^^^^^^
  | 2024-11-28T07:42:16.534Z | ^^^^^^^^^^^^^^^^^^^^^
  | 2024-11-28T07:42:16.534Z | ^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
  | 2024-11-28T07:42:16.534Z | raise ChildFailedError(
  | 2024-11-28T07:42:16.534Z | torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  | 2024-11-28T07:42:16.534Z | ============================================================
  | 2024-11-28T07:42:16.534Z | /opt/ml/code/dlrm_main.py FAILED
  | 2024-11-28T07:42:16.534Z | ------------------------------------------------------------
  | 2024-11-28T07:42:16.534Z | Failures: <NO_OTHER_FAILURES>
  | 2024-11-28T07:42:16.534Z | ------------------------------------------------------------
  | 2024-11-28T07:42:16.534Z | Root Cause (first observed failure):
  | 2024-11-28T07:42:16.534Z | [0]: time : 2024-11-28_07:42:14 host : algo-1 rank : 1 (local_rank: 1) exitcode : 1 (pid: 17774) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  | 2024-11-28T07:42:16.534Z | ============================================================
  | 2024-11-28T07:42:16.534Z | 2024-11-28 07:42:15,871 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
  | 2024-11-28T07:42:16.534Z | 2024-11-28 07:42:15,871 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
  | 2024-11-28T07:42:16.535Z | 2024-11-28 07:42:15,871 sagemaker-training-toolkit INFO Reporting training SUCCESS

@AvivBenchorin
Copy link
Contributor

Thank for providing the logs with NCCL_DEBUG=Info and FI_LOG_LEVEL=Warn. Based on those, I most suspect the root cause to be related to the Cannot allocate memory error produced by the EFA Libfabric provider:

...
  | 2024-11-28T07:42:11.527Z | libfabric:17775:1732779731::efa:cq:efa_cq_ibv_cq_ex_open_with_ibv_create_cq_ex():92 Unable to create extended CQ: Cannot allocate memory
  | 2024-11-28T07:42:11.527Z | libfabric:17775:1732779731::efa:cq:efa_rdm_cq_open():688 Unable to create extended CQ: Invalid argument
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] nccl_ofi_ofiutils_init_connection:274 NCCL WARN NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
  | 2024-11-28T07:42:11.527Z | algo-1:17775:17847 [2] nccl_net_ofi_create_plugin:259 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
...

Would you be able to share the ulimits (ulimit -a) configuration in your test environment? A suspected cause of the Cannot allocate memory would specifically be the locked memory (ulimit -l) configuration.

@Chevolier
Copy link
Author

Hi @AvivBenchorin , thanks for the answer. I added ulimit -a in the codes, and run my training using 2 ml.p4d.24xlarge in SageMaker Trainingjobs for 2 times, 1 succeeded and 1 failed. The logs are as follows:

  1. The good case:
    2024-12-04T05:26:25.055Z
    real-time non-blocking time (microseconds, -R) unlimited
    2024-12-04T05:26:25.055Z
    core file size (blocks, -c) unlimited
    2024-12-04T05:26:25.055Z
    data seg size (kbytes, -d) unlimited
    2024-12-04T05:26:25.055Z
    scheduling priority (-e) 0
    2024-12-04T05:26:25.055Z
    file size (blocks, -f) unlimited
    2024-12-04T05:26:25.055Z
    pending signals (-i) 30446
    2024-12-04T05:26:25.055Z
    max locked memory (kbytes, -l) unlimited
    2024-12-04T05:26:25.055Z
    max memory size (kbytes, -m) unlimited
    2024-12-04T05:26:25.055Z
    open files (-n) 65536
    2024-12-04T05:26:25.055Z
    pipe size (512 bytes, -p) 8
    2024-12-04T05:26:25.055Z
    POSIX message queues (bytes, -q) 819200
    2024-12-04T05:26:25.055Z
    real-time priority (-r) 0
    2024-12-04T05:26:25.055Z
    stack size (kbytes, -s) 65536
    2024-12-04T05:26:25.055Z
    cpu time (seconds, -t) unlimited
    2024-12-04T05:26:25.055Z
    max user processes (-u) unlimited
    2024-12-04T05:26:25.055Z
    virtual memory (kbytes, -v) unlimited
    2024-12-04T05:26:25.055Z
    file locks (-x) unlimited
    2024-12-04T05:26:27.056Z
    W1204 05:26:26.878000 17959 site-packages/torch/distributed/run.py:793]
    2024-12-04T05:26:27.056Z
    W1204 05:26:26.878000 17959 site-packages/torch/distributed/run.py:793] *****************************************
    2024-12-04T05:26:27.056Z
    W1204 05:26:26.878000 17959 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
    2024-12-04T05:26:27.056Z
    W1204 05:26:26.878000 17959 site-packages/torch/distributed/run.py:793] *****************************************
    2024-12-04T05:27:38.071Z
    PARAMS: (lr, batch_size, warmup_steps, decay_start, decay_steps): (0.001, 8192, 0, 0, 0)
    2024-12-04T05:27:39.071Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T05:27:39.071Z
    rank: 1, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T05:27:39.071Z
    [rank1]:[W1204 05:27:38.066254395 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T05:27:39.071Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T05:27:39.071Z
    rank: 0, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T05:27:39.071Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T05:27:39.071Z
    rank: 5, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T05:27:39.071Z
    [rank5]:[W1204 05:27:38.386392576 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T05:27:39.071Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T05:27:39.072Z
    rank: 3, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T05:27:39.072Z
    [rank3]:[W1204 05:27:38.399143903 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T05:27:39.072Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T05:27:39.072Z
    rank: 4, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T05:27:39.072Z
    [rank4]:[W1204 05:27:38.436923737 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T05:27:39.072Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T05:27:39.072Z
    rank: 6, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T05:27:39.072Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T05:27:39.072Z
    rank: 7, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T05:27:39.072Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T05:27:39.072Z
    rank: 2, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T05:27:39.072Z
    [rank6]:[W1204 05:27:38.469258416 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T05:27:39.072Z
    [rank7]:[W1204 05:27:38.469387648 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T05:27:39.072Z
    [rank2]:[W1204 05:27:38.469613975 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T05:27:39.072Z
    [rank0]:[W1204 05:27:38.472578167 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T05:27:39.072Z
    algo-1:18057:18057 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T05:27:39.072Z
    algo-1:18057:18057 [0] NCCL INFO Bootstrap : Using eth0:10.0.206.83<0>
    2024-12-04T05:27:39.072Z
    algo-1:18057:18057 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T05:27:39.072Z
    algo-1:18057:18057 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T05:27:39.072Z
    algo-1:18057:18057 [0] NCCL INFO cudaDriverVersion 12040
    2024-12-04T05:27:39.072Z
    NCCL version 2.21.5+cuda12.4
    2024-12-04T05:27:39.072Z
    algo-1:18062:18062 [5] NCCL INFO cudaDriverVersion 12040
    2024-12-04T05:27:39.072Z
    algo-1:18062:18062 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T05:27:39.072Z
    algo-1:18062:18062 [5] NCCL INFO Bootstrap : Using eth0:10.0.206.83<0>
    2024-12-04T05:27:39.072Z
    algo-1:18058:18058 [1] NCCL INFO cudaDriverVersion 12040
    2024-12-04T05:27:39.072Z
    algo-1:18058:18058 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T05:27:39.072Z
    algo-1:18058:18058 [1] NCCL INFO Bootstrap : Using eth0:10.0.206.83<0>
    2024-12-04T05:27:39.072Z
    algo-1:18062:18062 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T05:27:39.072Z
    algo-1:18062:18062 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T05:27:39.072Z
    algo-1:18058:18058 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T05:27:39.072Z
    algo-1:18058:18058 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T05:27:39.072Z
    algo-1:18063:18063 [6] NCCL INFO cudaDriverVersion 12040
    2024-12-04T05:27:39.072Z
    algo-1:18063:18063 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T05:27:39.072Z
    algo-1:18063:18063 [6] NCCL INFO Bootstrap : Using eth0:10.0.206.83<0>
    2024-12-04T05:27:39.072Z
    algo-1:18064:18064 [7] NCCL INFO cudaDriverVersion 12040
    2024-12-04T05:27:39.072Z
    algo-1:18064:18064 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T05:27:39.072Z
    algo-1:18064:18064 [7] NCCL INFO Bootstrap : Using eth0:10.0.206.83<0>
    2024-12-04T05:27:39.072Z
    algo-1:18063:18063 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T05:27:39.072Z
    algo-1:18063:18063 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T05:27:39.072Z
    algo-1:18064:18064 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T05:27:39.072Z
    algo-1:18064:18064 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T05:27:39.072Z
    algo-1:18060:18060 [3] NCCL INFO cudaDriverVersion 12040
    2024-12-04T05:27:39.072Z
    algo-1:18060:18060 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T05:27:39.072Z
    algo-1:18060:18060 [3] NCCL INFO Bootstrap : Using eth0:10.0.206.83<0>
    2024-12-04T05:27:39.072Z
    algo-1:18059:18059 [2] NCCL INFO cudaDriverVersion 12040
    2024-12-04T05:27:39.072Z
    algo-1:18059:18059 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T05:27:39.072Z
    algo-1:18059:18059 [2] NCCL INFO Bootstrap : Using eth0:10.0.206.83<0>
    2024-12-04T05:27:39.072Z
    algo-1:18060:18060 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T05:27:39.072Z
    algo-1:18060:18060 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T05:27:39.072Z
    algo-1:18059:18059 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T05:27:39.072Z
    algo-1:18059:18059 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T05:27:39.072Z
    algo-1:18061:18061 [4] NCCL INFO cudaDriverVersion 12040
    2024-12-04T05:27:39.072Z
    algo-1:18061:18061 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T05:27:39.072Z
    algo-1:18061:18061 [4] NCCL INFO Bootstrap : Using eth0:10.0.206.83<0>
    2024-12-04T05:27:39.072Z
    algo-1:18061:18061 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T05:27:39.072Z
    algo-1:18061:18061 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T05:27:39.072Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T05:27:39.072Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T05:27:39.072Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T05:27:39.073Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T05:27:39.073Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T05:27:39.073Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T05:27:39.073Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:39.073Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T05:27:39.073Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T05:27:39.073Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T05:27:39.073Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T05:27:39.073Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T05:27:39.073Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T05:27:39.073Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T05:27:39.073Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T05:27:39.073Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T05:27:39.073Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T05:27:39.074Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T05:27:39.074Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T05:27:39.074Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T05:27:39.074Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.074Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.074Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:39.074Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T05:27:39.074Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T05:27:39.074Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T05:27:39.074Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T05:27:39.074Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T05:27:39.074Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T05:27:39.074Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T05:27:39.074Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T05:27:39.074Z
    libfabric:18057:1733290059::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:39.074Z
    libfabric:18058:1733290059::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:39.074Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T05:27:39.074Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T05:27:39.074Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T05:27:39.074Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T05:27:39.074Z
    libfabric:18062:1733290059::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:39.074Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:39.074Z
    algo-1:18057:18130 [0] NCCL INFO Using non-device net plugin version 0
    2024-12-04T05:27:39.074Z
    algo-1:18057:18130 [0] NCCL INFO Using network AWS Libfabric
    2024-12-04T05:27:39.074Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:39.074Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T05:27:39.074Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T05:27:39.074Z
    algo-1:18058:18132 [1] NCCL INFO Using non-device net plugin version 0
    2024-12-04T05:27:39.074Z
    algo-1:18058:18132 [1] NCCL INFO Using network AWS Libfabric
    2024-12-04T05:27:39.074Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:39.074Z
    algo-1:18062:18131 [5] NCCL INFO Using non-device net plugin version 0
    2024-12-04T05:27:39.074Z
    algo-1:18062:18131 [5] NCCL INFO Using network AWS Libfabric
    2024-12-04T05:27:40.075Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T05:27:40.075Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T05:27:40.075Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T05:27:40.075Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T05:27:40.075Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T05:27:40.075Z
    libfabric:18064:1733290059::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:40.075Z
    libfabric:18063:1733290059::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:40.075Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T05:27:40.075Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T05:27:40.075Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T05:27:40.075Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T05:27:40.075Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:40.075Z
    algo-1:18064:18134 [7] NCCL INFO Using non-device net plugin version 0
    2024-12-04T05:27:40.075Z
    algo-1:18064:18134 [7] NCCL INFO Using network AWS Libfabric
    2024-12-04T05:27:40.075Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:40.075Z
    algo-1:18063:18133 [6] NCCL INFO Using non-device net plugin version 0
    2024-12-04T05:27:40.075Z
    algo-1:18063:18133 [6] NCCL INFO Using network AWS Libfabric
    2024-12-04T05:27:40.075Z
    libfabric:18060:1733290059::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:40.075Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T05:27:40.075Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T05:27:40.075Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:40.075Z
    algo-1:18060:18135 [3] NCCL INFO Using non-device net plugin version 0
    2024-12-04T05:27:40.075Z
    algo-1:18060:18135 [3] NCCL INFO Using network AWS Libfabric
    2024-12-04T05:27:40.075Z
    libfabric:18059:1733290059::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:40.075Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T05:27:40.075Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T05:27:40.075Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:40.075Z
    algo-1:18059:18136 [2] NCCL INFO Using non-device net plugin version 0
    2024-12-04T05:27:40.075Z
    algo-1:18059:18136 [2] NCCL INFO Using network AWS Libfabric
    2024-12-04T05:27:40.075Z
    libfabric:18061:1733290059::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:40.075Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T05:27:40.075Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T05:27:40.075Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:40.075Z
    algo-1:18061:18137 [4] NCCL INFO Using non-device net plugin version 0
    2024-12-04T05:27:40.075Z
    algo-1:18061:18137 [4] NCCL INFO Using network AWS Libfabric
    2024-12-04T05:27:40.075Z
    algo-1:18058:18132 [1] NCCL INFO DMA-BUF is available on GPU device 1
    2024-12-04T05:27:40.075Z
    algo-1:18062:18131 [5] NCCL INFO DMA-BUF is available on GPU device 5
    2024-12-04T05:27:40.075Z
    algo-1:18057:18130 [0] NCCL INFO DMA-BUF is available on GPU device 0
    2024-12-04T05:27:40.075Z
    algo-1:18064:18134 [7] NCCL INFO DMA-BUF is available on GPU device 7
    2024-12-04T05:27:40.075Z
    algo-1:18063:18133 [6] NCCL INFO DMA-BUF is available on GPU device 6
    2024-12-04T05:27:40.075Z
    algo-1:18060:18135 [3] NCCL INFO DMA-BUF is available on GPU device 3
    2024-12-04T05:27:40.075Z
    algo-1:18059:18136 [2] NCCL INFO DMA-BUF is available on GPU device 2
    2024-12-04T05:27:40.075Z
    algo-1:18061:18137 [4] NCCL INFO DMA-BUF is available on GPU device 4
    2024-12-04T05:27:42.076Z
    algo-1:18064:18134 [7] NCCL INFO ncclCommInitRank comm 0x564a47b7b3c0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x9c65e97761265de2 - Init START
    2024-12-04T05:27:42.076Z
    algo-1:18061:18137 [4] NCCL INFO ncclCommInitRank comm 0x562b5cc5cba0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x9c65e97761265de2 - Init START
    2024-12-04T05:27:42.076Z
    algo-1:18063:18133 [6] NCCL INFO ncclCommInitRank comm 0x55c93517fce0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x9c65e97761265de2 - Init START
    2024-12-04T05:27:42.076Z
    algo-1:18062:18131 [5] NCCL INFO ncclCommInitRank comm 0x56220f3a29f0 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x9c65e97761265de2 - Init START
    2024-12-04T05:27:42.076Z
    algo-1:18060:18135 [3] NCCL INFO ncclCommInitRank comm 0x5640eee77090 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x9c65e97761265de2 - Init START
    2024-12-04T05:27:42.076Z
    algo-1:18058:18132 [1] NCCL INFO ncclCommInitRank comm 0x564d4e132270 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x9c65e97761265de2 - Init START
    2024-12-04T05:27:42.076Z
    algo-1:18059:18136 [2] NCCL INFO ncclCommInitRank comm 0x55e5bd726dc0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x9c65e97761265de2 - Init START
    2024-12-04T05:27:42.076Z
    algo-1:18057:18130 [0] NCCL INFO ncclCommInitRank comm 0x56129032bf60 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x9c65e97761265de2 - Init START
    2024-12-04T05:27:42.076Z
    algo-1:18063:18133 [6] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:42.076Z
    algo-1:18062:18131 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:42.076Z
    algo-1:18059:18136 [2] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:42.076Z
    algo-1:18057:18130 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:42.076Z
    algo-1:18064:18134 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:42.076Z
    algo-1:18061:18137 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:42.076Z
    algo-1:18060:18135 [3] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:42.076Z
    algo-1:18058:18132 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T05:27:44.077Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18057:18130 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18062:18131 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18063:18133 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18059:18136 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18064:18134 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18060:18135 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18058:18132 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18057:18130 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
    2024-12-04T05:27:44.077Z
    algo-1:18057:18130 [0] NCCL INFO NVLS multicast support is not available on dev 0
    2024-12-04T05:27:44.077Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.077Z
    algo-1:18061:18137 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.078Z
    algo-1:18062:18131 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
    2024-12-04T05:27:44.078Z
    algo-1:18062:18131 [5] NCCL INFO NVLS multicast support is not available on dev 5
    2024-12-04T05:27:44.078Z
    algo-1:18063:18133 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
    2024-12-04T05:27:44.078Z
    algo-1:18063:18133 [6] NCCL INFO NVLS multicast support is not available on dev 6
    2024-12-04T05:27:44.078Z
    algo-1:18059:18136 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
    2024-12-04T05:27:44.078Z
    algo-1:18064:18134 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
    2024-12-04T05:27:44.078Z
    algo-1:18064:18134 [7] NCCL INFO NVLS multicast support is not available on dev 7
    2024-12-04T05:27:44.078Z
    algo-1:18059:18136 [2] NCCL INFO NVLS multicast support is not available on dev 2
    2024-12-04T05:27:44.078Z
    algo-1:18060:18135 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
    2024-12-04T05:27:44.078Z
    algo-1:18060:18135 [3] NCCL INFO NVLS multicast support is not available on dev 3
    2024-12-04T05:27:44.078Z
    algo-1:18058:18132 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
    2024-12-04T05:27:44.078Z
    algo-1:18058:18132 [1] NCCL INFO NVLS multicast support is not available on dev 1
    2024-12-04T05:27:44.078Z
    algo-1:18061:18137 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
    2024-12-04T05:27:44.078Z
    algo-1:18061:18137 [4] NCCL INFO NVLS multicast support is not available on dev 4
    2024-12-04T05:27:44.078Z
    algo-1:18064:18134 [7] NCCL INFO comm 0x564a47b7b3c0 rank 7 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0
    2024-12-04T05:27:44.078Z
    algo-1:18063:18133 [6] NCCL INFO comm 0x55c93517fce0 rank 6 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0
    2024-12-04T05:27:44.078Z
    algo-1:18064:18134 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] 0/-1/-1->7->6 [2] 0/-1/-1->7->6 [3] 0/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] 0/-1/-1->7->6 [6] 0/-1/-1->7->6 [7] 0/-1/-1->7->6
    2024-12-04T05:27:44.078Z
    algo-1:18062:18131 [5] NCCL INFO comm 0x56220f3a29f0 rank 5 nRanks 16 nNodes 2 localRanks 8 localRank 5 MNNVL 0
    2024-12-04T05:27:44.078Z
    algo-1:18061:18137 [4] NCCL INFO comm 0x562b5cc5cba0 rank 4 nRanks 16 nNodes 2 localRanks 8 localRank 4 MNNVL 0
    2024-12-04T05:27:44.078Z
    algo-1:18063:18133 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/14/-1->6->-1 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->14
    2024-12-04T05:27:44.078Z
    algo-1:18064:18134 [7] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T05:27:44.078Z
    algo-1:18063:18133 [6] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T05:27:44.078Z
    algo-1:18062:18131 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] -1/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] -1/-1/-1->5->4
    2024-12-04T05:27:44.078Z
    algo-1:18061:18137 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/12/-1->4->-1 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->12 [7] 5/-1/-1->4->3
    2024-12-04T05:27:44.078Z
    algo-1:18062:18131 [5] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T05:27:44.078Z
    algo-1:18061:18137 [4] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T05:27:44.078Z
    algo-1:18060:18135 [3] NCCL INFO comm 0x5640eee77090 rank 3 nRanks 16 nNodes 2 localRanks 8 localRank 3 MNNVL 0
    2024-12-04T05:27:44.078Z
    algo-1:18058:18132 [1] NCCL INFO comm 0x564d4e132270 rank 1 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0
    2024-12-04T05:27:44.078Z
    algo-1:18059:18136 [2] NCCL INFO comm 0x55e5bd726dc0 rank 2 nRanks 16 nNodes 2 localRanks 8 localRank 2 MNNVL 0
    2024-12-04T05:27:44.078Z
    algo-1:18060:18135 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] 4/-1/-1->3->2
    2024-12-04T05:27:44.078Z
    algo-1:18058:18132 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0
    2024-12-04T05:27:44.078Z
    algo-1:18060:18135 [3] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T05:27:44.078Z
    algo-1:18058:18132 [1] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T05:27:44.078Z
    algo-1:18059:18136 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/10/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->10 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1
    2024-12-04T05:27:44.078Z
    algo-1:18059:18136 [2] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO comm 0x56129032bf60 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO Channel 00/08 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO Channel 01/08 : 0 7 6 5 4 3 10 9 8 15 14 13 12 11 2 1
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO Channel 02/08 : 0 7 6 5 12 11 10 9 8 15 14 13 4 3 2 1
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO Channel 03/08 : 0 7 14 13 12 11 10 9 8 15 6 5 4 3 2 1
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO Channel 04/08 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO Channel 05/08 : 0 7 6 5 4 3 10 9 8 15 14 13 12 11 2 1
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO Channel 06/08 : 0 7 6 5 12 11 10 9 8 15 14 13 4 3 2 1
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO Channel 07/08 : 0 7 14 13 12 11 10 9 8 15 6 5 4 3 2 1
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->7 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->8 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->7
    2024-12-04T05:27:44.078Z
    algo-1:18057:18130 [0] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T05:27:44.078Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.078Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.078Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.078Z
    algo-1:18064:18134 [7] NCCL INFO Channel 03/0 : 7[7] -> 14[6] [send] via NET/AWS Libfabric/3/GDRDMA
    2024-12-04T05:27:44.078Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.078Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.078Z
    algo-1:18058:18132 [1] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [send] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T05:27:44.078Z
    algo-1:18060:18135 [3] NCCL INFO Channel 01/0 : 3[3] -> 10[2] [send] via NET/AWS Libfabric/1/GDRDMA
    2024-12-04T05:27:44.078Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.078Z
    libfabric:18063:1733290063::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:44.078Z
    algo-1:18064:18134 [7] NCCL INFO Channel 07/0 : 7[7] -> 14[6] [send] via NET/AWS Libfabric/3/GDRDMA
    2024-12-04T05:27:44.078Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.078Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.079Z
    algo-1:18058:18132 [1] NCCL INFO Channel 04/0 : 1[1] -> 8[0] [send] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.079Z
    algo-1:18060:18135 [3] NCCL INFO Channel 05/0 : 3[3] -> 10[2] [send] via NET/AWS Libfabric/1/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.079Z
    libfabric:18057:1733290063::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:44.079Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.079Z
    algo-1:18062:18131 [5] NCCL INFO Channel 02/0 : 5[5] -> 12[4] [send] via NET/AWS Libfabric/2/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.079Z
    algo-1:18062:18131 [5] NCCL INFO Channel 06/0 : 5[5] -> 12[4] [send] via NET/AWS Libfabric/2/GDRDMA
    2024-12-04T05:27:44.079Z
    libfabric:18059:1733290063::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:44.079Z
    libfabric:18061:1733290063::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 03/0 : 15[7] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 07/0 : 15[7] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 02/0 : 13[5] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 06/0 : 13[5] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18062:18131 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18060:18135 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18063:18133 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18062:18131 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18060:18135 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18062:18131 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18060:18135 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18062:18131 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18060:18135 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18062:18131 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18060:18135 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18062:18131 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18060:18135 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18058:18132 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18058:18132 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18058:18132 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18061:18137 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18058:18132 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18058:18132 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18058:18132 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18057:18130 [0] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.079Z
    algo-1:18057:18130 [0] NCCL INFO Channel 04/0 : 9[1] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18057:18130 [0] NCCL INFO Channel 00/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:44.079Z
    algo-1:18059:18136 [2] NCCL INFO Channel 01/0 : 11[3] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA
    2024-12-04T05:27:44.079Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:44.079Z
    algo-1:18059:18136 [2] NCCL INFO Channel 05/0 : 11[3] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA
    2024-12-04T05:27:44.079Z
    libfabric:18062:1733290063::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:44.079Z
    algo-1:18057:18130 [0] NCCL INFO Channel 01/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18057:18130 [0] NCCL INFO Channel 02/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18057:18130 [0] NCCL INFO Channel 03/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18057:18130 [0] NCCL INFO Channel 04/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18057:18130 [0] NCCL INFO Channel 05/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18057:18130 [0] NCCL INFO Channel 06/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18057:18130 [0] NCCL INFO Channel 07/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18059:18136 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18064:18134 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18059:18136 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18064:18134 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18059:18136 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18064:18134 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18064:18134 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18059:18136 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18064:18134 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18059:18136 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18064:18134 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18059:18136 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18059:18136 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    algo-1:18059:18136 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:44.080Z
    libfabric:18058:1733290064::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:44.080Z
    libfabric:18060:1733290064::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:44.080Z
    libfabric:18064:1733290064::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T05:27:45.082Z
    algo-1:18062:18131 [5] NCCL INFO Connected all rings
    2024-12-04T05:27:45.082Z
    algo-1:18060:18135 [3] NCCL INFO Connected all rings
    2024-12-04T05:27:45.082Z
    algo-1:18061:18137 [4] NCCL INFO Connected all rings
    2024-12-04T05:27:45.082Z
    algo-1:18063:18133 [6] NCCL INFO Connected all rings
    2024-12-04T05:27:45.082Z
    algo-1:18064:18134 [7] NCCL INFO Connected all rings
    2024-12-04T05:27:45.082Z
    algo-1:18063:18133 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18057:18130 [0] NCCL INFO Connected all rings
    2024-12-04T05:27:45.082Z
    algo-1:18061:18137 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18057:18130 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18060:18135 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18058:18132 [1] NCCL INFO Connected all rings
    2024-12-04T05:27:45.082Z
    algo-1:18059:18136 [2] NCCL INFO Connected all rings
    2024-12-04T05:27:45.082Z
    algo-1:18063:18133 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18060:18135 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18061:18137 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18057:18130 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18063:18133 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18062:18131 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18060:18135 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18061:18137 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18057:18130 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18063:18133 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18062:18131 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18060:18135 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18061:18137 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:45.082Z
    algo-1:18057:18130 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18063:18133 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18062:18131 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18060:18135 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18057:18130 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18061:18137 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18063:18133 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18062:18131 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18060:18135 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18057:18130 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18061:18137 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18063:18133 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18062:18131 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18057:18130 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18059:18136 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18061:18137 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18063:18133 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18062:18131 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18057:18130 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18059:18136 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18058:18132 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18061:18137 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18064:18134 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18059:18136 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.083Z
    algo-1:18063:18133 [6] NCCL INFO Channel 03/0 : 14[6] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA
    2024-12-04T05:27:45.083Z
    algo-1:18058:18132 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.083Z
    algo-1:18063:18133 [6] NCCL INFO Channel 07/0 : 14[6] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA
    2024-12-04T05:27:45.083Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.083Z
    algo-1:18063:18133 [6] NCCL INFO Channel 03/0 : 6[6] -> 14[6] [send] via NET/AWS Libfabric/3/GDRDMA
    2024-12-04T05:27:45.083Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.083Z
    algo-1:18064:18134 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18063:18133 [6] NCCL INFO Channel 07/0 : 6[6] -> 14[6] [send] via NET/AWS Libfabric/3/GDRDMA
    2024-12-04T05:27:45.083Z
    algo-1:18059:18136 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18058:18132 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18062:18131 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18064:18134 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.083Z
    algo-1:18059:18136 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18061:18137 [4] NCCL INFO Channel 02/0 : 12[4] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA
    2024-12-04T05:27:45.083Z
    algo-1:18058:18132 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.083Z
    algo-1:18062:18131 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18061:18137 [4] NCCL INFO Channel 06/0 : 12[4] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA
    2024-12-04T05:27:45.083Z
    algo-1:18064:18134 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:45.083Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.083Z
    algo-1:18061:18137 [4] NCCL INFO Channel 02/0 : 4[4] -> 12[4] [send] via NET/AWS Libfabric/2/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.084Z
    algo-1:18061:18137 [4] NCCL INFO Channel 06/0 : 4[4] -> 12[4] [send] via NET/AWS Libfabric/2/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18064:18134 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18064:18134 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18058:18132 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18058:18132 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18060:18135 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18060:18135 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.084Z
    algo-1:18058:18132 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.084Z
    algo-1:18057:18130 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO Channel 01/0 : 10[2] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.084Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.084Z
    algo-1:18057:18130 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO Channel 05/0 : 10[2] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.084Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.084Z
    algo-1:18057:18130 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO Channel 01/0 : 2[2] -> 10[2] [send] via NET/AWS Libfabric/1/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.084Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:45.084Z
    algo-1:18058:18132 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18057:18130 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [send] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO Channel 05/0 : 2[2] -> 10[2] [send] via NET/AWS Libfabric/1/GDRDMA
    2024-12-04T05:27:45.084Z
    algo-1:18064:18134 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18064:18134 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:45.084Z
    algo-1:18060:18135 [3] NCCL INFO Connected all trees
    2024-12-04T05:27:45.084Z
    algo-1:18060:18135 [3] NCCL INFO NCCL_PROTO set by environment to simple
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO Connected all trees
    2024-12-04T05:27:45.084Z
    algo-1:18060:18135 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO NCCL_PROTO set by environment to simple
    2024-12-04T05:27:45.084Z
    algo-1:18060:18135 [3] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
    2024-12-04T05:27:45.084Z
    algo-1:18059:18136 [2] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
    2024-12-04T05:27:45.084Z
    algo-1:18057:18130 [0] NCCL INFO Connected all trees
    2024-12-04T05:27:45.084Z
    algo-1:18058:18132 [1] NCCL INFO Connected all trees
    2024-12-04T05:27:45.084Z
    algo-1:18057:18130 [0] NCCL INFO NCCL_PROTO set by environment to simple
    2024-12-04T05:27:45.084Z
    algo-1:18058:18132 [1] NCCL INFO NCCL_PROTO set by environment to simple
    2024-12-04T05:27:45.084Z
    algo-1:18058:18132 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
    2024-12-04T05:27:45.084Z
    algo-1:18058:18132 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
    2024-12-04T05:27:45.084Z
    algo-1:18057:18130 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
    2024-12-04T05:27:45.084Z
    algo-1:18057:18130 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
    2024-12-04T05:27:45.084Z
    algo-1:18064:18134 [7] NCCL INFO Connected all trees
    2024-12-04T05:27:45.084Z
    algo-1:18063:18133 [6] NCCL INFO Connected all trees
    2024-12-04T05:27:45.085Z
    algo-1:18064:18134 [7] NCCL INFO NCCL_PROTO set by environment to simple
    2024-12-04T05:27:45.085Z
    algo-1:18064:18134 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
    2024-12-04T05:27:45.085Z
    algo-1:18063:18133 [6] NCCL INFO NCCL_PROTO set by environment to simple
    2024-12-04T05:27:45.085Z
    algo-1:18064:18134 [7] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
    2024-12-04T05:27:45.085Z
    algo-1:18063:18133 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
    2024-12-04T05:27:45.085Z
    algo-1:18063:18133 [6] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
    2024-12-04T05:27:45.085Z
    algo-1:18062:18131 [5] NCCL INFO Connected all trees
    2024-12-04T05:27:45.085Z
    algo-1:18061:18137 [4] NCCL INFO Connected all trees
    2024-12-04T05:27:45.085Z
    algo-1:18062:18131 [5] NCCL INFO NCCL_PROTO set by environment to simple
    2024-12-04T05:27:45.085Z
    algo-1:18061:18137 [4] NCCL INFO NCCL_PROTO set by environment to simple
    2024-12-04T05:27:45.085Z
    algo-1:18062:18131 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
    2024-12-04T05:27:45.085Z
    algo-1:18061:18137 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
    2024-12-04T05:27:45.085Z
    algo-1:18062:18131 [5] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
    2024-12-04T05:27:45.085Z
    algo-1:18061:18137 [4] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
    2024-12-04T05:27:45.085Z
    algo-1:18060:18135 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
    2024-12-04T05:27:45.085Z
    algo-1:18057:18130 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
    2024-12-04T05:27:45.085Z
    algo-1:18059:18136 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
    2024-12-04T05:27:45.085Z
    algo-1:18058:18132 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
    2024-12-04T05:27:45.085Z
    algo-1:18058:18132 [1] NCCL INFO ncclCommInitRank comm 0x564d4e132270 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x9c65e97761265de2 - Init COMPLETE
    2024-12-04T05:27:45.085Z
    algo-1:18060:18135 [3] NCCL INFO ncclCommInitRank comm 0x5640eee77090 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x9c65e97761265de2 - Init COMPLETE
    2024-12-04T05:27:45.085Z
    algo-1:18059:18136 [2] NCCL INFO ncclCommInitRank comm 0x55e5bd726dc0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x9c65e97761265de2 - Init COMPLETE
    2024-12-04T05:27:45.085Z
    algo-1:18057:18130 [0] NCCL INFO ncclCommInitRank comm 0x56129032bf60 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x9c65e97761265de2 - Init COMPLETE
    2024-12-04T05:27:45.085Z
    algo-1:18062:18131 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
    2024-12-04T05:27:45.085Z
    algo-1:18064:18134 [7] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
    2024-12-04T05:27:45.085Z
    algo-1:18063:18133 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
    2024-12-04T05:27:45.085Z
    algo-1:18061:18137 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
    2024-12-04T05:27:45.085Z
    algo-1:18062:18131 [5] NCCL INFO ncclCommInitRank comm 0x56220f3a29f0 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x9c65e97761265de2 - Init COMPLETE
    2024-12-04T05:27:45.085Z
    algo-1:18063:18133 [6] NCCL INFO ncclCommInitRank comm 0x55c93517fce0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x9c65e97761265de2 - Init COMPLETE
    2024-12-04T05:27:45.085Z
    algo-1:18064:18134 [7] NCCL INFO ncclCommInitRank comm 0x564a47b7b3c0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x9c65e97761265de2 - Init COMPLETE
    2024-12-04T05:27:45.085Z
    algo-1:18061:18137 [4] NCCL INFO ncclCommInitRank comm 0x562b5cc5cba0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x9c65e97761265de2 - Init COMPLETE
    2024-12-04T05:27:45.085Z
    /opt/conda/lib/python3.11/site-packages/torchrec/optim/apply_optimizer_in_backward.py:49: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. from torch.distributed.optim import _apply_optimizer_in_backward
    2024-12-04T05:27:45.085Z
    /opt/conda/lib/python3.11/site-packages/torchrec/optim/apply_optimizer_in_backward.py:49: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. from torch.distributed.optim import _apply_optimizer_in_backward
    2024-12-04T05:27:45.085Z
    /opt/conda/lib/python3.11/site-packages/torchrec/optim/apply_optimizer_in_backward.py:49: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. from torch.distributed.optim import _apply_optimizer_in_backward
    2024-12-04T05:27:45.085Z
    /opt/conda/lib/python3.11/site-packages/torchrec/optim/apply_optimizer_in_backward.py:49: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. from torch.distributed.optim import _apply_optimizer_in_backward
    2024-12-04T05:27:45.085Z
    /opt/conda/lib/python3.11/site-packages/torchrec/optim/apply_optimizer_in_backward.py:49: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. from torch.distributed.optim import _apply_optimizer_in_backward
    2024-12-04T05:27:45.085Z
    /opt/conda/lib/python3.11/site-packages/torchrec/optim/apply_optimizer_in_backward.py:49: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. from torch.distributed.optim import _apply_optimizer_in_backward
    2024-12-04T05:27:45.085Z
    /opt/conda/lib/python3.11/site-packages/torchrec/optim/apply_optimizer_in_backward.py:49: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. from torch.distributed.optim import _apply_optimizer_in_backward
    2024-12-04T05:27:45.085Z
    /opt/conda/lib/python3.11/site-packages/torchrec/optim/apply_optimizer_in_backward.py:49: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. from torch.distributed.optim import _apply_optimizer_in_backward
    2024-12-04T05:27:45.085Z
    model:
    2024-12-04T05:27:45.121Z
    DistributedModelParallel( (_dmp_wrapped_module): DistributedDataParallel( (module): DNNModelMainTrain( (model): DNNModelMainTorchRec( (sparse_arch): SparseArch( (embedding_bag_collection): ShardedEmbeddingBagCollection( (lookups): GroupedPooledEmbeddingsLookup( (_emb_modules): ModuleList( (0): BatchedFusedEmbeddingBag( (_emb_module): SplitTableBatchedEmbeddingBagsCodegen() ) ) ) (_output_dists): RwPooledEmbeddingDist() (embedding_bags): ModuleDict( (shared_embedding): Module() ) ) ) (_gateEmbedding): GateEmbedding( (layer1): Linear(in_features=5168, out_features=323, bias=True) (act2): Sigmoid() ) (_h1): Linear(in_features=5168, out_features=1024, bias=True) (_h2): FourChannelHidden( (wc2): Linear(in_features=256, out_features=256, bias=True) (wc3): Linear(in_features=1024, out_features=256, bias=True) (w): Linear(in_features=1795, out_features=512, bias=True) (act1): Tanh() (act): ReLU() ) (_h3): FourChannelHidden( (wc2): Linear(in_features=128, out_features=128, bias=True) (wc3): Linear(in_features=512, out_features=128, bias=True) (w): Linear(in_features=899, out_features=256, bias=True) (act1): Tanh() (act): ReLU() ) (_h4): Linear(in_features=256, out_features=1, bias=True) (_bn): BatchNorm1d(5168, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True) (act0): Sigmoid() ) ) )
    2024-12-04T05:27:45.122Z
    )
    2024-12-04T05:27:45.122Z
    model.sparse_arch.embedding_bag_collection
    2024-12-04T05:27:45.122Z
    shared_embedding
    2024-12-04T05:27:45.122Z
    ParameterSharding(sharding_type='row_wise', compute_kernel='fused', ranks=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], sharding_spec=EnumerableShardingSpec(shards=[ShardMetadata(shard_offsets=[0, 0], shard_sizes=[62500000, 16], placement=rank:0/cuda:0), ShardMetadata(shard_offsets=[62500000, 0], shard_sizes=[62500000, 16], placement=rank:1/cuda:1), ShardMetadata(shard_offsets=[125000000, 0], shard_sizes=[62500000, 16], placement=rank:2/cuda:2), ShardMetadata(shard_offsets=[187500000, 0], shard_sizes=[62500000, 16], placement=rank:3/cuda:3), ShardMetadata(shard_offsets=[250000000, 0], shard_sizes=[62500000, 16], placement=rank:4/cuda:4), ShardMetadata(shard_offsets=[312500000, 0], shard_sizes=[62500000, 16], placement=rank:5/cuda:5), ShardMetadata(shard_offsets=[375000000, 0], shard_sizes=[62500000, 16], placement=rank:6/cuda:6), ShardMetadata(shard_offsets=[437500000, 0], shard_sizes=[62500000, 16], placement=rank:7/cuda:7), ShardMetadata(shard_offsets=[500000000, 0], shard_sizes=[62500000, 16], placement=rank:8/cuda:0), ShardMetadata(shard_offsets=[562500000, 0], shard_sizes=[62500000, 16], placement=rank:9/cuda:1), ShardMetadata(shard_offsets=[625000000, 0], shard_sizes=[62500000, 16], placement=rank:10/cuda:2), ShardMetadata(shard_offsets=[687500000, 0], shard_sizes=[62500000, 16], placement=rank:11/cuda:3), ShardMetadata(shard_offsets=[750000000, 0], shard_sizes=[62500000, 16], placement=rank:12/cuda:4), ShardMetadata(shard_offsets=[812500000, 0], shard_sizes=[62500000, 16], placement=rank:13/cuda:5), ShardMetadata(shard_offsets=[875000000, 0], shard_sizes=[62500000, 16], placement=rank:14/cuda:6), ShardMetadata(shard_offsets=[937500000, 0], shard_sizes=[62500000, 16], placement=rank:15/cuda:7)]), cache_params=None, enforce_hbm=None, stochastic_rounding=None, bounds_check_mode=None, output_dtype=None, key_value_params=None)
    2024-12-04T05:27:46.122Z
    Epoch 0: 0%| | 0/2820 [00:00<?, ?it/s]
    2024-12-04T05:27:47.122Z
    algo-1:18058:20779 [1] NCCL INFO Channel 00/1 : 1[1] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:47.122Z
    algo-1:18058:20779 [1] NCCL INFO Channel 01/1 : 1[1] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:47.122Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 7[7] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 7[7] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 15[7] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 15[7] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18057:20781 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18057:20781 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 14[6] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 14[6] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18057:20781 [0] NCCL INFO Channel 00/1 : 0[0] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18057:20781 [0] NCCL INFO Channel 01/1 : 0[0] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18059:20782 [2] NCCL INFO Channel 00/1 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18059:20782 [2] NCCL INFO Channel 01/1 : 2[2] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 15[7] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 15[7] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18058:20779 [1] NCCL INFO Channel 00/1 : 1[1] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18058:20779 [1] NCCL INFO Channel 01/1 : 1[1] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18063:20783 [6] NCCL INFO Channel 00/1 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18063:20783 [6] NCCL INFO Channel 01/1 : 6[6] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18061:20784 [4] NCCL INFO Channel 00/1 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18061:20784 [4] NCCL INFO Channel 01/1 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 7[7] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.123Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 7[7] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:47.123Z
    algo-1:18062:20785 [5] NCCL INFO Channel 00/1 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18062:20785 [5] NCCL INFO Channel 01/1 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18062:20785 [5] NCCL INFO Channel 00/1 : 5[5] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18062:20785 [5] NCCL INFO Channel 01/1 : 5[5] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:47.123Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.124Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 6[6] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:47.124Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.124Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 6[6] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:47.124Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.124Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 7[7] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:47.124Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:47.124Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 7[7] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.124Z
    algo-1:18060:20786 [3] NCCL INFO Channel 00/1 : 3[3] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:48.124Z
    algo-1:18060:20786 [3] NCCL INFO Channel 01/1 : 3[3] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:48.124Z
    algo-1:18060:20786 [3] NCCL INFO Channel 00/1 : 3[3] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:48.124Z
    algo-1:18060:20786 [3] NCCL INFO Channel 01/1 : 3[3] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:48.124Z
    algo-1:18060:20786 [3] NCCL INFO Channel 00/1 : 3[3] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.124Z
    algo-1:18060:20786 [3] NCCL INFO Channel 01/1 : 3[3] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.124Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.124Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 5[5] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.124Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.124Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 5[5] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.124Z
    algo-1:18059:20782 [2] NCCL INFO Channel 00/1 : 2[2] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:48.124Z
    algo-1:18061:20784 [4] NCCL INFO Channel 00/1 : 4[4] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.124Z
    algo-1:18059:20782 [2] NCCL INFO Channel 01/1 : 2[2] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:48.124Z
    algo-1:18061:20784 [4] NCCL INFO Channel 01/1 : 4[4] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 15[7] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 15[7] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18059:20782 [2] NCCL INFO Channel 00/1 : 2[2] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18059:20782 [2] NCCL INFO Channel 01/1 : 2[2] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18061:20784 [4] NCCL INFO Channel 00/1 : 4[4] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18061:20784 [4] NCCL INFO Channel 01/1 : 4[4] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 14[6] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 14[6] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18058:20779 [1] NCCL INFO Channel 00/1 : 1[1] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18058:20779 [1] NCCL INFO Channel 01/1 : 1[1] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 6[6] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 5[5] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 6[6] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 4[4] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 5[5] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 4[4] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 13[5] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 13[5] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18058:20779 [1] NCCL INFO Channel 00/1 : 1[1] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 13[5] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 13[5] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18057:20781 [0] NCCL INFO Channel 00/1 : 0[0] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18058:20779 [1] NCCL INFO Channel 01/1 : 1[1] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18057:20781 [0] NCCL INFO Channel 01/1 : 0[0] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 12[4] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 12[4] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.125Z
    algo-1:18057:20781 [0] NCCL INFO Channel 00/1 : 0[0] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18057:20781 [0] NCCL INFO Channel 01/1 : 0[0] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:48.125Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.125Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 15[7] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 15[7] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 4[4] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 4[4] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 14[6] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 6[6] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 14[6] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18059:20782 [2] NCCL INFO Channel 00/1 : 2[2] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 6[6] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 15[7] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 15[7] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18060:20786 [3] NCCL INFO Channel 00/1 : 3[3] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:48.126Z
    algo-1:18059:20782 [2] NCCL INFO Channel 01/1 : 2[2] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.126Z
    algo-1:18060:20786 [3] NCCL INFO Channel 01/1 : 3[3] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 13[5] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 13[5] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18059:20782 [2] NCCL INFO Channel 00/1 : 2[2] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:48.126Z
    algo-1:18059:20782 [2] NCCL INFO Channel 01/1 : 2[2] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:48.126Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 11[3] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 11[3] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18057:20781 [0] NCCL INFO Channel 00/1 : 0[0] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:48.126Z
    algo-1:18057:20781 [0] NCCL INFO Channel 01/1 : 0[0] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 6[6] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 6[6] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 7[7] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 7[7] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 14[6] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.126Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.126Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 14[6] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 3[3] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 3[3] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 5[5] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 5[5] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 7[7] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 7[7] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 7[7] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 7[7] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 12[4] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 12[4] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 00/1 : 1[1] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 01/1 : 1[1] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.127Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 11[3] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 11[3] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 00/1 : 1[1] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 01/1 : 1[1] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:48.127Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 13[5] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 13[5] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 15[7] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 15[7] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 10[2] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 10[2] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 14[6] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.127Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 14[6] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:48.127Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 10[2] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 3[3] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 10[2] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18057:20781 [0] NCCL INFO Channel 00/1 : 0[0] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.128Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 1[1] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18057:20781 [0] NCCL INFO Channel 01/1 : 0[0] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:48.128Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 3[3] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 1[1] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 12[4] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 12[4] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 12[4] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 3[3] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 12[4] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 3[3] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 2[2] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 5[5] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 2[2] -> 8[0] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 5[5] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 4[4] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 4[4] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 14[6] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 14[6] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 11[3] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 11[3] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 2[2] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.128Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.128Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 2[2] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 10[2] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 10[2] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 2[2] -> 10[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 2[2] -> 10[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 9[1] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 9[1] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 9[1] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 2[2] -> 11[3] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 9[1] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 2[2] -> 11[3] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 1[1] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 1[1] -> 9[1] [send] via NET/AWS Libfabric/0(0)/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 7[7] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:48.129Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:48.129Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 7[7] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 13[5] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 6[6] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 13[5] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 5[5] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 6[6] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 5[5] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 15[7] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 15[7] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 6[6] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 6[6] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 14[6] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 14[6] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 6[6] -> 14[6] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 6[6] -> 14[6] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 4[4] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 4[4] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 9[1] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 9[1] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18057:20781 [0] NCCL INFO Channel 00/1 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:49.130Z
    algo-1:18057:20781 [0] NCCL INFO Channel 01/1 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T05:27:49.130Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 8[0] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.130Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.130Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 8[0] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 0[0] -> 8[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 0[0] -> 8[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 12[4] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 12[4] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 4[4] -> 12[4] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 4[4] -> 12[4] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 0[0] -> 9[1] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18057:18152 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 0[0] -> 9[1] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 15[7] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 15[7] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 7[7] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 7[7] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 14[6] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 14[6] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 00/1 : 7[7] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 01/1 : 7[7] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.131Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 13[5] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 13[5] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 13[5] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 00/1 : 7[7] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.131Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 13[5] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18064:20780 [7] NCCL INFO Channel 01/1 : 7[7] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.131Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 6[6] -> 15[7] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 6[6] -> 15[7] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 11[3] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 11[3] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.131Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 3[3] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.131Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 3[3] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 10[2] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 10[2] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 3[3] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 3[3] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 8[0] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 9[1] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18058:18145 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 8[0] -> 1[1] [receive] via NET/AWS Libfabric/0/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 9[1] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 3[3] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 3[3] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 11[3] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 11[3] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 4[4] -> 13[5] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 4[4] -> 13[5] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 8[0] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 0[0] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 8[0] -> 2[2] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 1[1] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 0[0] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.132Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 1[1] -> 10[2] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 2[2] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.132Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 13[5] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 13[5] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 2[2] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 5[5] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 5[5] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 12[4] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 12[4] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 10[2] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 10[2] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 8[0] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18060:18148 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 8[0] -> 3[3] [receive] via NET/AWS Libfabric/1/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 5[5] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 12[4] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 3[3] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 4[4] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 12[4] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18063:20783 [6] NCCL INFO Channel 00/1 : 6[6] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.133Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 3[3] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 5[5] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 11[3] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.133Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 11[3] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.133Z
    algo-1:18063:20783 [6] NCCL INFO Channel 01/1 : 6[6] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.133Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 4[4] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 5[5] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 5[5] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 10[2] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 10[2] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18062:20785 [5] NCCL INFO Channel 00/1 : 5[5] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.134Z
    algo-1:18062:20785 [5] NCCL INFO Channel 01/1 : 5[5] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.134Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 1[1] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 1[1] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 12[4] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 12[4] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18064:20780 [7] NCCL INFO Channel 00/1 : 7[7] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.134Z
    algo-1:18064:20780 [7] NCCL INFO Channel 01/1 : 7[7] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.134Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 1[1] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 1[1] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 0[0] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18059:18146 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 0[0] -> 11[3] [send] via NET/AWS Libfabric/1(2)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 9[1] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 9[1] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 2[2] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 2[2] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 11[3] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 2[2] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.134Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 11[3] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.134Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18063:20783 [6] NCCL INFO Channel 00/1 : 6[6] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 4[4] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18063:20783 [6] NCCL INFO Channel 01/1 : 6[6] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 2[2] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 4[4] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18061:20784 [4] NCCL INFO Channel 04/1 : 8[0] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 1[1] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18061:20784 [4] NCCL INFO Channel 05/1 : 8[0] -> 4[4] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18061:20784 [4] NCCL INFO Channel 00/1 : 4[4] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 1[1] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18061:20784 [4] NCCL INFO Channel 01/1 : 4[4] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 11[3] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 11[3] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18064:20780 [7] NCCL INFO Channel 00/1 : 7[7] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18064:20780 [7] NCCL INFO Channel 01/1 : 7[7] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 10[2] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 10[2] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18063:20783 [6] NCCL INFO Channel 00/1 : 6[6] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18063:20783 [6] NCCL INFO Channel 01/1 : 6[6] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 9[1] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 9[1] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18062:20785 [5] NCCL INFO Channel 00/1 : 5[5] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18062:20785 [5] NCCL INFO Channel 01/1 : 5[5] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.135Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.135Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 9[1] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.135Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18059:20782 [2] NCCL INFO Channel 04/1 : 2[2] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 9[1] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18063:20783 [6] NCCL INFO Channel 00/1 : 6[6] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:49.136Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18059:20782 [2] NCCL INFO Channel 05/1 : 2[2] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18063:20783 [6] NCCL INFO Channel 01/1 : 6[6] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:49.136Z
    algo-1:18060:20786 [3] NCCL INFO Channel 04/1 : 3[3] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18060:20786 [3] NCCL INFO Channel 05/1 : 3[3] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18060:20786 [3] NCCL INFO Channel 00/1 : 3[3] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.136Z
    algo-1:18060:20786 [3] NCCL INFO Channel 01/1 : 3[3] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.136Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 0[0] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 0[0] -> 12[4] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18061:20784 [4] NCCL INFO Channel 00/1 : 4[4] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.136Z
    algo-1:18061:20784 [4] NCCL INFO Channel 01/1 : 4[4] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.136Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 1[1] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 1[1] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18062:20785 [5] NCCL INFO Channel 04/1 : 8[0] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18062:18140 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18062:20785 [5] NCCL INFO Channel 05/1 : 8[0] -> 5[5] [receive] via NET/AWS Libfabric/2/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18062:20785 [5] NCCL INFO Channel 00/1 : 5[5] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.136Z
    algo-1:18062:20785 [5] NCCL INFO Channel 01/1 : 5[5] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.136Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18058:20779 [1] NCCL INFO Channel 04/1 : 1[1] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.136Z
    algo-1:18058:20779 [1] NCCL INFO Channel 05/1 : 1[1] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.136Z
    algo-1:18059:20782 [2] NCCL INFO Channel 00/1 : 2[2] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.136Z
    algo-1:18062:20785 [5] NCCL INFO Channel 00/1 : 5[5] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18059:20782 [2] NCCL INFO Channel 01/1 : 2[2] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18062:20785 [5] NCCL INFO Channel 01/1 : 5[5] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 10[2] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 10[2] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 00/1 : 7[7] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 0[0] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18061:18141 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 0[0] -> 13[5] [send] via NET/AWS Libfabric/2(4)/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 01/1 : 7[7] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 9[1] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 9[1] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 00/1 : 7[7] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18060:20786 [3] NCCL INFO Channel 00/1 : 3[3] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 01/1 : 7[7] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18060:20786 [3] NCCL INFO Channel 01/1 : 3[3] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 04/1 : 8[0] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18064:18139 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 05/1 : 8[0] -> 7[7] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 00/1 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18064:20780 [7] NCCL INFO Channel 01/1 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18060:20786 [3] NCCL INFO Channel 00/1 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18058:20779 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18060:20786 [3] NCCL INFO Channel 01/1 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18058:20779 [1] NCCL INFO Channel 01/1 : 1[1] -> 0[0] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18062:20785 [5] NCCL INFO Channel 00/1 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18063:20783 [6] NCCL INFO Channel 04/1 : 8[0] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18062:20785 [5] NCCL INFO Channel 01/1 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18063:20783 [6] NCCL INFO Channel 05/1 : 8[0] -> 6[6] [receive] via NET/AWS Libfabric/3/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18063:20783 [6] NCCL INFO Channel 00/1 : 6[6] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18061:20784 [4] NCCL INFO Channel 00/1 : 4[4] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18061:20784 [4] NCCL INFO Channel 01/1 : 4[4] -> 2[2] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18063:20783 [6] NCCL INFO Channel 01/1 : 6[6] -> 4[4] via P2P/CUMEM/read
    2024-12-04T05:27:49.137Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 0[0] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 0[0] -> 14[6] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18057:20781 [0] NCCL INFO Channel 04/1 : 0[0] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18063:18138 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T05:27:49.137Z
    algo-1:18057:20781 [0] NCCL INFO Channel 05/1 : 0[0] -> 15[7] [send] via NET/AWS Libfabric/3(6)/GDRDMA/Shared
    2024-12-04T05:27:49.137Z
    algo-1:18061:20784 [4] NCCL INFO Channel 00/1 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:49.138Z
    algo-1:18061:20784 [4] NCCL INFO Channel 01/1 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T05:27:49.138Z
    algo-1:18059:20782 [2] NCCL INFO Channel 00/1 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.138Z
    algo-1:18059:20782 [2] NCCL INFO Channel 01/1 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T05:27:49.138Z
    algo-1:18063:20783 [6] NCCL INFO Channel 00/1 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:49.138Z
    algo-1:18063:20783 [6] NCCL INFO Channel 01/1 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T05:27:51.138Z
    /opt/conda/lib/python3.11/site-packages/torchrec/distributed/comm_ops.py:2157: FutureWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. req = dist._reduce_scatter_base(
    2024-12-04T05:27:51.138Z
    /opt/conda/lib/python3.11/site-packages/torchrec/distributed/comm_ops.py:2157: FutureWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. req = dist._reduce_scatter_base(
    2024-12-04T05:27:51.138Z
    /opt/conda/lib/python3.11/site-packages/torchrec/distributed/comm_ops.py:2157: FutureWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. req = dist._reduce_scatter_base(
    2024-12-04T05:27:51.138Z
    /opt/conda/lib/python3.11/site-packages/torchrec/distributed/comm_ops.py:2157: FutureWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. req = dist._reduce_scatter_base(
    2024-12-04T05:27:51.138Z
    /opt/conda/lib/python3.11/site-packages/torchrec/distributed/comm_ops.py:2157: FutureWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. req = dist._reduce_scatter_base(
    2024-12-04T05:27:51.139Z
    /opt/conda/lib/python3.11/site-packages/torchrec/distributed/comm_ops.py:2157: FutureWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. req = dist._reduce_scatter_base(
    2024-12-04T05:27:51.139Z
    /opt/conda/lib/python3.11/site-packages/torchrec/distributed/comm_ops.py:2157: FutureWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. req = dist._reduce_scatter_base(
    2024-12-04T05:27:51.139Z
    /opt/conda/lib/python3.11/site-packages/torchrec/distributed/comm_ops.py:2157: FutureWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. req = dist._reduce_scatter_base(
    2024-12-04T05:27:52.139Z
    /opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. return func(*args, **kwargs)
    2024-12-04T05:27:52.139Z
    /opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. return func(*args, **kwargs)
    2024-12-04T05:27:52.139Z
    /opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. return func(*args, **kwargs)
    2024-12-04T05:27:52.139Z
    /opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. return func(*args, **kwargs)
    2024-12-04T05:27:52.139Z
    /opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. return func(*args, **kwargs)
    2024-12-04T05:27:52.139Z
    /opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. return func(*args, **kwargs)
    2024-12-04T05:27:52.139Z
    /opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. return func(*args, **kwargs)
    2024-12-04T05:27:52.139Z
    /opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:83: FutureWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. return func(*args, **kwargs)
    2024-12-04T05:27:52.139Z
    Epoch 0: 0%| | 1/2820 [00:06<5:14:01, 6.68s/it]
    2024-12-04T05:27:53.140Z
    Epoch 0: 0%| | 2/2820 [00:07<2:19:26, 2.97s/it]
    2024-12-04T05:27:53.140Z
    Epoch 0: 0%| | 3/2820 [00:07<1:22:15, 1.75s/it]
    2024-12-04T05:27:53.140Z
    Epoch 0: 0%| | 4/2820 [00:07<55:43, 1.19s/it]
    2024-12-04T05:27:54.140Z
    Epoch 0: 0%| | 5/2820 [00:07<40:55, 1.15it/s]
    2024-12-04T05:27:54.140Z

  2. Failed case:
    2024-12-04T07:19:22.472Z
    real-time non-blocking time (microseconds, -R) unlimited
    2024-12-04T07:19:22.472Z
    core file size (blocks, -c) unlimited
    2024-12-04T07:19:22.472Z
    data seg size (kbytes, -d) unlimited
    2024-12-04T07:19:22.472Z
    scheduling priority (-e) 0
    2024-12-04T07:19:22.472Z
    file size (blocks, -f) unlimited
    2024-12-04T07:19:22.472Z
    pending signals (-i) 30446
    2024-12-04T07:19:22.472Z
    max locked memory (kbytes, -l) unlimited
    2024-12-04T07:19:22.472Z
    max memory size (kbytes, -m) unlimited
    2024-12-04T07:19:22.472Z
    open files (-n) 65536
    2024-12-04T07:19:22.472Z
    pipe size (512 bytes, -p) 8
    2024-12-04T07:19:22.472Z
    POSIX message queues (bytes, -q) 819200
    2024-12-04T07:19:22.472Z
    real-time priority (-r) 0
    2024-12-04T07:19:22.472Z
    stack size (kbytes, -s) 65536
    2024-12-04T07:19:22.472Z
    cpu time (seconds, -t) unlimited
    2024-12-04T07:19:22.472Z
    max user processes (-u) unlimited
    2024-12-04T07:19:22.472Z
    virtual memory (kbytes, -v) unlimited
    2024-12-04T07:19:22.472Z
    file locks (-x) unlimited
    2024-12-04T07:19:25.473Z
    W1204 07:19:24.745000 18465 site-packages/torch/distributed/run.py:793]
    2024-12-04T07:19:25.473Z
    W1204 07:19:24.745000 18465 site-packages/torch/distributed/run.py:793] *****************************************
    2024-12-04T07:19:25.473Z
    W1204 07:19:24.745000 18465 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
    2024-12-04T07:19:25.473Z
    W1204 07:19:24.745000 18465 site-packages/torch/distributed/run.py:793] *****************************************
    2024-12-04T07:19:41.476Z
    PARAMS: (lr, batch_size, warmup_steps, decay_start, decay_steps): (0.001, 8192, 0, 0, 0)
    2024-12-04T07:19:42.477Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T07:19:42.477Z
    rank: 6, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T07:19:42.477Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T07:19:42.477Z
    rank: 0, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T07:19:42.477Z
    [rank6]:[W1204 07:19:42.887363840 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T07:19:42.477Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T07:19:42.477Z
    rank: 4, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T07:19:42.477Z
    [rank4]:[W1204 07:19:42.930065756 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T07:19:42.477Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T07:19:42.477Z
    rank: 3, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T07:19:42.477Z
    [rank3]:[W1204 07:19:42.007794791 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T07:19:42.477Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T07:19:42.477Z
    rank: 2, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T07:19:42.477Z
    [rank2]:[W1204 07:19:42.056318302 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T07:19:42.477Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T07:19:42.477Z
    rank: 1, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T07:19:42.477Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T07:19:42.477Z
    rank: 7, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T07:19:42.477Z
    [rank1]:[W1204 07:19:42.077628907 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T07:19:42.477Z
    [rank7]:[W1204 07:19:42.077628931 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T07:19:42.477Z
    number of train_paths: 5760, valid_paths: 240, test_paths: 240
    2024-12-04T07:19:42.477Z
    rank: 5, train_dataloader: 2820, test_dataloader: 165
    2024-12-04T07:19:42.477Z
    [rank5]:[W1204 07:19:42.088912807 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T07:19:42.477Z
    [rank0]:[W1204 07:19:42.109306917 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
    2024-12-04T07:19:42.477Z
    algo-1:18562:18562 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T07:19:42.477Z
    algo-1:18562:18562 [0] NCCL INFO Bootstrap : Using eth0:10.0.95.160<0>
    2024-12-04T07:19:43.477Z
    algo-1:18562:18562 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T07:19:43.477Z
    algo-1:18562:18562 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T07:19:43.477Z
    algo-1:18562:18562 [0] NCCL INFO cudaDriverVersion 12040
    2024-12-04T07:19:43.477Z
    NCCL version 2.21.5+cuda12.4
    2024-12-04T07:19:43.477Z
    algo-1:18563:18563 [1] NCCL INFO cudaDriverVersion 12040
    2024-12-04T07:19:43.477Z
    algo-1:18563:18563 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T07:19:43.477Z
    algo-1:18563:18563 [1] NCCL INFO Bootstrap : Using eth0:10.0.95.160<0>
    2024-12-04T07:19:43.477Z
    algo-1:18563:18563 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T07:19:43.477Z
    algo-1:18563:18563 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T07:19:43.478Z
    algo-1:18565:18565 [3] NCCL INFO cudaDriverVersion 12040
    2024-12-04T07:19:43.478Z
    algo-1:18565:18565 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T07:19:43.478Z
    algo-1:18565:18565 [3] NCCL INFO Bootstrap : Using eth0:10.0.95.160<0>
    2024-12-04T07:19:43.478Z
    algo-1:18564:18564 [2] NCCL INFO cudaDriverVersion 12040
    2024-12-04T07:19:43.478Z
    algo-1:18564:18564 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T07:19:43.478Z
    algo-1:18564:18564 [2] NCCL INFO Bootstrap : Using eth0:10.0.95.160<0>
    2024-12-04T07:19:43.478Z
    algo-1:18565:18565 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T07:19:43.478Z
    algo-1:18565:18565 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T07:19:43.478Z
    algo-1:18564:18564 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T07:19:43.478Z
    algo-1:18564:18564 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T07:19:43.478Z
    algo-1:18566:18566 [4] NCCL INFO cudaDriverVersion 12040
    2024-12-04T07:19:43.478Z
    algo-1:18566:18566 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T07:19:43.478Z
    algo-1:18566:18566 [4] NCCL INFO Bootstrap : Using eth0:10.0.95.160<0>
    2024-12-04T07:19:43.478Z
    algo-1:18566:18566 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T07:19:43.478Z
    algo-1:18566:18566 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T07:19:43.478Z
    algo-1:18568:18568 [6] NCCL INFO cudaDriverVersion 12040
    2024-12-04T07:19:43.478Z
    algo-1:18568:18568 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T07:19:43.478Z
    algo-1:18568:18568 [6] NCCL INFO Bootstrap : Using eth0:10.0.95.160<0>
    2024-12-04T07:19:43.478Z
    algo-1:18569:18569 [7] NCCL INFO cudaDriverVersion 12040
    2024-12-04T07:19:43.478Z
    algo-1:18569:18569 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T07:19:43.478Z
    algo-1:18569:18569 [7] NCCL INFO Bootstrap : Using eth0:10.0.95.160<0>
    2024-12-04T07:19:43.478Z
    algo-1:18568:18568 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T07:19:43.478Z
    algo-1:18568:18568 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T07:19:43.478Z
    algo-1:18567:18567 [5] NCCL INFO cudaDriverVersion 12040
    2024-12-04T07:19:43.478Z
    algo-1:18567:18567 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
    2024-12-04T07:19:43.478Z
    algo-1:18567:18567 [5] NCCL INFO Bootstrap : Using eth0:10.0.95.160<0>
    2024-12-04T07:19:43.478Z
    algo-1:18569:18569 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T07:19:43.478Z
    algo-1:18569:18569 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T07:19:43.478Z
    algo-1:18567:18567 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
    2024-12-04T07:19:43.478Z
    algo-1:18567:18567 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T07:19:43.478Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T07:19:43.478Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T07:19:43.478Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T07:19:43.478Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T07:19:43.478Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T07:19:43.478Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T07:19:43.479Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T07:19:43.479Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T07:19:43.479Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.479Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.479Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:43.479Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T07:19:43.479Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T07:19:43.479Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T07:19:43.479Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T07:19:43.479Z
    libfabric:18562:1733296782::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:43.479Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T07:19:43.479Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T07:19:43.479Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T07:19:43.479Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:43.479Z
    algo-1:18562:18635 [0] NCCL INFO Using non-device net plugin version 0
    2024-12-04T07:19:43.479Z
    algo-1:18562:18635 [0] NCCL INFO Using network AWS Libfabric
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T07:19:43.479Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T07:19:43.479Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T07:19:43.479Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T07:19:43.479Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T07:19:43.479Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T07:19:43.479Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T07:19:43.479Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T07:19:43.480Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Using Libfabric version 1.22
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Using CUDA driver version 12040
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Configuring AWS-specific options
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Setting provider_filter to efa
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Internode latency set at 75.0 us
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Creating one domain per process
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
    2024-12-04T07:19:43.480Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T07:19:43.480Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T07:19:43.480Z
    libfabric:18563:1733296782::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:43.480Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T07:19:43.480Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T07:19:43.480Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:43.480Z
    algo-1:18563:18636 [1] NCCL INFO Using non-device net plugin version 0
    2024-12-04T07:19:43.480Z
    algo-1:18563:18636 [1] NCCL INFO Using network AWS Libfabric
    2024-12-04T07:19:43.480Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T07:19:43.480Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T07:19:43.480Z
    libfabric:18564:1733296782::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:43.480Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T07:19:43.480Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T07:19:43.480Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:43.480Z
    algo-1:18564:18638 [2] NCCL INFO Using non-device net plugin version 0
    2024-12-04T07:19:43.480Z
    algo-1:18564:18638 [2] NCCL INFO Using network AWS Libfabric
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
    2024-12-04T07:19:43.480Z
    libfabric:18565:1733296782::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:43.480Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T07:19:43.480Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T07:19:43.480Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:43.480Z
    algo-1:18565:18637 [3] NCCL INFO Using non-device net plugin version 0
    2024-12-04T07:19:43.480Z
    algo-1:18565:18637 [3] NCCL INFO Using network AWS Libfabric
    2024-12-04T07:19:43.480Z
    libfabric:18566:1733296782::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:43.480Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T07:19:43.480Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T07:19:43.480Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:43.480Z
    algo-1:18566:18639 [4] NCCL INFO Using non-device net plugin version 0
    2024-12-04T07:19:43.480Z
    algo-1:18566:18639 [4] NCCL INFO Using network AWS Libfabric
    2024-12-04T07:19:43.480Z
    algo-1:18562:18635 [0] NCCL INFO DMA-BUF is available on GPU device 0
    2024-12-04T07:19:43.480Z
    libfabric:18567:1733296782::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO Using non-device net plugin version 0
    2024-12-04T07:19:43.480Z
    algo-1:18567:18642 [5] NCCL INFO Using network AWS Libfabric
    2024-12-04T07:19:43.480Z
    libfabric:18569:1733296782::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T07:19:43.480Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:43.481Z
    algo-1:18569:18641 [7] NCCL INFO Using non-device net plugin version 0
    2024-12-04T07:19:43.481Z
    algo-1:18569:18641 [7] NCCL INFO Using network AWS Libfabric
    2024-12-04T07:19:43.481Z
    libfabric:18568:1733296782::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:43.481Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
    2024-12-04T07:19:43.481Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
    2024-12-04T07:19:43.481Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:43.481Z
    algo-1:18568:18640 [6] NCCL INFO Using non-device net plugin version 0
    2024-12-04T07:19:43.481Z
    algo-1:18568:18640 [6] NCCL INFO Using network AWS Libfabric
    2024-12-04T07:19:43.481Z
    algo-1:18563:18636 [1] NCCL INFO DMA-BUF is available on GPU device 1
    2024-12-04T07:19:43.481Z
    algo-1:18564:18638 [2] NCCL INFO DMA-BUF is available on GPU device 2
    2024-12-04T07:19:43.481Z
    algo-1:18565:18637 [3] NCCL INFO DMA-BUF is available on GPU device 3
    2024-12-04T07:19:43.481Z
    algo-1:18566:18639 [4] NCCL INFO DMA-BUF is available on GPU device 4
    2024-12-04T07:19:43.481Z
    algo-1:18567:18642 [5] NCCL INFO DMA-BUF is available on GPU device 5
    2024-12-04T07:19:43.481Z
    algo-1:18569:18641 [7] NCCL INFO DMA-BUF is available on GPU device 7
    2024-12-04T07:19:43.481Z
    algo-1:18568:18640 [6] NCCL INFO DMA-BUF is available on GPU device 6
    2024-12-04T07:19:44.481Z
    algo-1:18566:18639 [4] NCCL INFO ncclCommInitRank comm 0x55bbe3155fb0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 901c0 commId 0xd6fe8029bfb252f0 - Init START
    2024-12-04T07:19:44.481Z
    algo-1:18569:18641 [7] NCCL INFO ncclCommInitRank comm 0x5600ab614030 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId a01d0 commId 0xd6fe8029bfb252f0 - Init START
    2024-12-04T07:19:44.481Z
    algo-1:18568:18640 [6] NCCL INFO ncclCommInitRank comm 0x55fd3e248290 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId a01c0 commId 0xd6fe8029bfb252f0 - Init START
    2024-12-04T07:19:44.481Z
    algo-1:18567:18642 [5] NCCL INFO ncclCommInitRank comm 0x560b96d74770 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 901d0 commId 0xd6fe8029bfb252f0 - Init START
    2024-12-04T07:19:44.481Z
    algo-1:18565:18637 [3] NCCL INFO ncclCommInitRank comm 0x556e87b74910 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 201d0 commId 0xd6fe8029bfb252f0 - Init START
    2024-12-04T07:19:44.481Z
    algo-1:18562:18635 [0] NCCL INFO ncclCommInitRank comm 0x56106ff88e80 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 101c0 commId 0xd6fe8029bfb252f0 - Init START
    2024-12-04T07:19:44.481Z
    algo-1:18564:18638 [2] NCCL INFO ncclCommInitRank comm 0x55bee3a05970 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 201c0 commId 0xd6fe8029bfb252f0 - Init START
    2024-12-04T07:19:44.481Z
    algo-1:18563:18636 [1] NCCL INFO ncclCommInitRank comm 0x55c706a39990 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 101d0 commId 0xd6fe8029bfb252f0 - Init START
    2024-12-04T07:19:44.481Z
    algo-1:18566:18639 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:44.481Z
    algo-1:18568:18640 [6] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:44.481Z
    algo-1:18569:18641 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:44.481Z
    algo-1:18562:18635 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:44.481Z
    algo-1:18563:18636 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:44.481Z
    algo-1:18567:18642 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:44.481Z
    algo-1:18564:18638 [2] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:44.481Z
    algo-1:18565:18637 [3] NCCL INFO NCCL_TOPO_FILE set by environment to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    2024-12-04T07:19:45.482Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18562:18635 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18563:18636 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18568:18640 [6] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18564:18638 [2] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18567:18642 [5] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18569:18641 [7] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18566:18639 [4] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18562:18635 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
    2024-12-04T07:19:45.482Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18562:18635 [0] NCCL INFO NVLS multicast support is not available on dev 0
    2024-12-04T07:19:45.482Z
    algo-1:18565:18637 [3] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:45.482Z
    algo-1:18568:18640 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
    2024-12-04T07:19:45.482Z
    algo-1:18568:18640 [6] NCCL INFO NVLS multicast support is not available on dev 6
    2024-12-04T07:19:45.482Z
    algo-1:18563:18636 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
    2024-12-04T07:19:45.482Z
    algo-1:18563:18636 [1] NCCL INFO NVLS multicast support is not available on dev 1
    2024-12-04T07:19:45.482Z
    algo-1:18567:18642 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
    2024-12-04T07:19:45.482Z
    algo-1:18567:18642 [5] NCCL INFO NVLS multicast support is not available on dev 5
    2024-12-04T07:19:45.482Z
    algo-1:18569:18641 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
    2024-12-04T07:19:45.482Z
    algo-1:18569:18641 [7] NCCL INFO NVLS multicast support is not available on dev 7
    2024-12-04T07:19:45.482Z
    algo-1:18564:18638 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
    2024-12-04T07:19:45.482Z
    algo-1:18566:18639 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
    2024-12-04T07:19:45.482Z
    algo-1:18566:18639 [4] NCCL INFO NVLS multicast support is not available on dev 4
    2024-12-04T07:19:45.482Z
    algo-1:18564:18638 [2] NCCL INFO NVLS multicast support is not available on dev 2
    2024-12-04T07:19:45.482Z
    algo-1:18565:18637 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
    2024-12-04T07:19:45.482Z
    algo-1:18565:18637 [3] NCCL INFO NVLS multicast support is not available on dev 3
    2024-12-04T07:19:46.483Z
    algo-1:18569:18641 [7] NCCL INFO comm 0x5600ab614030 rank 7 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0
    2024-12-04T07:19:46.483Z
    algo-1:18568:18640 [6] NCCL INFO comm 0x55fd3e248290 rank 6 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0
    2024-12-04T07:19:46.483Z
    algo-1:18567:18642 [5] NCCL INFO comm 0x560b96d74770 rank 5 nRanks 16 nNodes 2 localRanks 8 localRank 5 MNNVL 0
    2024-12-04T07:19:46.483Z
    algo-1:18566:18639 [4] NCCL INFO comm 0x55bbe3155fb0 rank 4 nRanks 16 nNodes 2 localRanks 8 localRank 4 MNNVL 0
    2024-12-04T07:19:46.483Z
    algo-1:18565:18637 [3] NCCL INFO comm 0x556e87b74910 rank 3 nRanks 16 nNodes 2 localRanks 8 localRank 3 MNNVL 0
    2024-12-04T07:19:46.483Z
    algo-1:18569:18641 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
    2024-12-04T07:19:46.483Z
    algo-1:18568:18640 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
    2024-12-04T07:19:46.483Z
    algo-1:18567:18642 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
    2024-12-04T07:19:46.483Z
    algo-1:18569:18641 [7] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T07:19:46.483Z
    algo-1:18568:18640 [6] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T07:19:46.483Z
    algo-1:18566:18639 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
    2024-12-04T07:19:46.483Z
    algo-1:18567:18642 [5] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T07:19:46.483Z
    algo-1:18565:18637 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
    2024-12-04T07:19:46.483Z
    algo-1:18566:18639 [4] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T07:19:46.483Z
    algo-1:18562:18635 [0] NCCL INFO comm 0x56106ff88e80 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
    2024-12-04T07:19:46.483Z
    algo-1:18565:18637 [3] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T07:19:46.483Z
    algo-1:18563:18636 [1] NCCL INFO comm 0x55c706a39990 rank 1 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0
    2024-12-04T07:19:46.483Z
    algo-1:18564:18638 [2] NCCL INFO comm 0x55bee3a05970 rank 2 nRanks 16 nNodes 2 localRanks 8 localRank 2 MNNVL 0
    2024-12-04T07:19:46.483Z
    algo-1:18563:18636 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
    2024-12-04T07:19:46.483Z
    algo-1:18562:18635 [0] NCCL INFO Channel 00/02 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
    2024-12-04T07:19:46.483Z
    algo-1:18563:18636 [1] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T07:19:46.483Z
    algo-1:18562:18635 [0] NCCL INFO Channel 01/02 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
    2024-12-04T07:19:46.483Z
    algo-1:18564:18638 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
    2024-12-04T07:19:46.483Z
    algo-1:18564:18638 [2] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T07:19:46.483Z
    algo-1:18562:18635 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->8
    2024-12-04T07:19:46.483Z
    algo-1:18562:18635 [0] NCCL INFO P2P Chunksize set to 131072
    2024-12-04T07:19:46.483Z
    algo-1:18568:18640 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18565:18637 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18567:18642 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18564:18638 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18568:18640 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18566:18639 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18565:18637 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18567:18642 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18564:18638 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18566:18639 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
    2024-12-04T07:19:46.483Z
    algo-1:18563:18649 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:46.483Z
    algo-1:18563:18636 [1] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [send] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T07:19:46.483Z
    algo-1:18563:18649 [1] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:46.483Z
    algo-1:18563:18636 [1] NCCL INFO Channel 01/0 : 1[1] -> 8[0] [send] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T07:19:46.484Z
    algo-1:18562:18648 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:46.484Z
    libfabric:18562:1733296785::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:46.484Z
    libfabric:18563:1733296785::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:46.484Z
    algo-1:18563:18649 [1] create_send_comm:1853 NCCL WARN NET/OFI Received an invalid tag 0 for device 0
    2024-12-04T07:19:46.484Z
    algo-1:18563:18649 [1] NCCL INFO transport/net.cc:687 -> 3
    2024-12-04T07:19:46.484Z
    libfabric:18563:1733296785::efa:ep_ctrl:efa_rdm_ep_set_shared_memory_permitted():1410 FI_OPT_SHARED_MEMORY_PERMITTED set to false
    2024-12-04T07:19:46.484Z
    algo-1:18563:18649 [1] create_send_comm:1853 NCCL WARN NET/OFI Received an invalid tag 0 for device 0
    2024-12-04T07:19:46.484Z
    algo-1:18563:18649 [1] NCCL INFO transport/net.cc:687 -> 3
    2024-12-04T07:19:46.484Z
    algo-1:18562:18635 [0] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T07:19:46.484Z
    algo-1:18562:18648 [0] NCCL INFO NET/OFI Global registrations supported
    2024-12-04T07:19:46.484Z
    algo-1:18562:18635 [0] NCCL INFO Channel 01/0 : 9[1] -> 0[0] [receive] via NET/AWS Libfabric/0/GDRDMA
    2024-12-04T07:19:46.484Z
    algo-1:18562:18635 [0] NCCL INFO Channel 00/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T07:19:46.484Z
    algo-1:18562:18635 [0] NCCL INFO Channel 01/0 : 0[0] -> 7[7] via P2P/CUMEM/read
    2024-12-04T07:19:46.484Z
    algo-1:18566:18639 [4] NCCL INFO Connected all rings
    2024-12-04T07:19:46.484Z
    algo-1:18569:18641 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T07:19:46.484Z
    algo-1:18563:18636 [1] NCCL INFO transport/net.cc:306 -> 3
    2024-12-04T07:19:46.484Z
    algo-1:18563:18636 [1] NCCL INFO transport.cc:165 -> 3
    2024-12-04T07:19:46.484Z
    algo-1:18563:18636 [1] NCCL INFO init.cc:1263 -> 3
    2024-12-04T07:19:46.484Z
    algo-1:18563:18636 [1] NCCL INFO init.cc:1548 -> 3
    2024-12-04T07:19:46.484Z
    algo-1:18563:18636 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
    2024-12-04T07:19:46.484Z
    algo-1:18566:18639 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T07:19:46.484Z
    algo-1:18563:18563 [1] NCCL INFO group.cc:418 -> 3
    2024-12-04T07:19:46.484Z
    algo-1:18563:18563 [1] NCCL INFO init.cc:1929 -> 3
    2024-12-04T07:19:46.484Z
    algo-1:18569:18641 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
    2024-12-04T07:19:46.484Z
    algo-1:18566:18639 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
    2024-12-04T07:19:46.484Z
    algo-1:18567:18642 [5] NCCL INFO Connected all rings
    2024-12-04T07:19:46.484Z
    algo-1:18567:18642 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T07:19:46.484Z
    algo-1:18567:18642 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
    2024-12-04T07:19:46.484Z
    [rank1]: Traceback (most recent call last):
    2024-12-04T07:19:46.484Z
    [rank1]: File "/opt/ml/code/dlrm_main.py", line 953, in
    2024-12-04T07:19:46.484Z
    [rank1]: invoke_main()
    2024-12-04T07:19:46.484Z
    [rank1]: File "/opt/ml/code/dlrm_main.py", line 950, in invoke_main
    2024-12-04T07:19:46.484Z
    [rank1]: main(sys.argv[1:])
    2024-12-04T07:19:46.484Z
    [rank1]: File "/opt/ml/code/dlrm_main.py", line 848, in main
    2024-12-04T07:19:46.484Z
    [rank1]: torch.distributed.barrier()
    2024-12-04T07:19:46.484Z
    [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    2024-12-04T07:19:46.484Z
    [rank1]: return func(*args, **kwargs)
    2024-12-04T07:19:46.484Z
    [rank1]: ^^^^^^^^^^^^^^^^^^^^^
    2024-12-04T07:19:46.484Z
    [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
    2024-12-04T07:19:46.484Z
    [rank1]: work = group.barrier(opts=opts)
    2024-12-04T07:19:46.484Z
    [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
    2024-12-04T07:19:46.484Z
    [rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5

[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-12-04T07:19:46.484Z
[rank1]: ncclInternalError: Internal check failed.
2024-12-04T07:19:46.484Z
[rank1]: Last error:
2024-12-04T07:19:46.484Z
[rank1]: NET/OFI Received an invalid tag 0 for device 0
2024-12-04T07:19:47.485Z
W1204 07:19:47.027000 18465 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18562 closing signal SIGTERM
2024-12-04T07:19:47.485Z
W1204 07:19:47.028000 18465 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18564 closing signal SIGTERM
2024-12-04T07:19:47.485Z
W1204 07:19:47.028000 18465 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18565 closing signal SIGTERM
2024-12-04T07:19:47.485Z
W1204 07:19:47.029000 18465 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18566 closing signal SIGTERM
2024-12-04T07:19:47.485Z
W1204 07:19:47.029000 18465 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18567 closing signal SIGTERM
2024-12-04T07:19:47.485Z
W1204 07:19:47.029000 18465 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18568 closing signal SIGTERM
2024-12-04T07:19:47.485Z
W1204 07:19:47.029000 18465 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18569 closing signal SIGTERM
2024-12-04T07:19:48.485Z
E1204 07:19:48.108000 18465 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 18563) of binary: /opt/conda/bin/python3.11
2024-12-04T07:19:48.485Z
Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in
2024-12-04T07:19:48.485Z
sys.exit(main())
2024-12-04T07:19:48.485Z
^^^
2024-12-04T07:19:48.485Z
^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
2024-12-04T07:19:48.485Z
return f(*args, **kwargs) ^^^
2024-12-04T07:19:48.485Z
^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
2024-12-04T07:19:48.485Z
run(args)
2024-12-04T07:19:48.485Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
2024-12-04T07:19:48.485Z
elastic_launch(
2024-12-04T07:19:48.485Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
2024-12-04T07:19:48.485Z
return launch_agent(self._config, self._entrypoint, list(args))
2024-12-04T07:19:48.485Z
^
2024-12-04T07:19:48.485Z
^^^^^^^^

^^^^^^^^
2024-12-04T07:19:48.485Z
^^^^^^^^^^^^^^^^^^^^^
2024-12-04T07:19:48.486Z
^^^^^^^^^^^^^^^^^^^^^
2024-12-04T07:19:48.486Z
^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
2024-12-04T07:19:48.486Z
raise ChildFailedError(
2024-12-04T07:19:48.486Z
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
2024-12-04T07:19:48.486Z

2024-12-04T07:19:48.486Z
/opt/ml/code/dlrm_main.py FAILED
2024-12-04T07:19:48.486Z

2024-12-04T07:19:48.486Z
Failures: <NO_OTHER_FAILURES>
2024-12-04T07:19:48.486Z

2024-12-04T07:19:48.486Z
Root Cause (first observed failure):
2024-12-04T07:19:48.486Z
[0]: time : 2024-12-04_07:19:47 host : algo-1 rank : 1 (local_rank: 1) exitcode : 1 (pid: 18563) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2024-12-04T07:19:48.486Z

2024-12-04T07:19:49.486Z
2024-12-04 07:19:48,520 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2024-12-04T07:19:49.486Z
2024-12-04 07:19:48,520 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2024-12-04T07:19:49.486Z
2024-12-04 07:19:48,521 sagemaker-training-toolkit INFO Reporting training SUCCESS

It seems that they both have the same ulimit settings. My case is a bit special in that I have to dowload a lot of data in the begining before the training start. Also the number of files to open is a lot, around 57600 for each of the two machines, and I also used pin_memory in the dataloader. I do not know if this has something to do with Cannot allocate memory.

@rauteric
Copy link
Contributor

Hi,

Since the ulimit settings look OK, some suggestions:

  1. Is it possible loading the large amount of data is causing the instance to run out of host memory? If you are able to run top or free -h during the time the error is encountered, that would show whether the instance is almost out of memory.
  2. In the past, Libfabric reserving huge pages has sometimes caused out-of-memory issues for the application. Setting FI_EFA_USE_HUGE_PAGE=0 may help. (This is listed in our 'EFA cheatsheet' of common env variable settings: https://github.com/aws/aws-ofi-nccl/blob/master/doc/efa-env-var.md.)
  3. If you share the instance ID and approximate timestamp of the error, we can look at hardware logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants