Skip to content

Failing to run cudf_merge benchmark on a node with 4 H100 #1088

Open
@orliac

Description

@orliac

Hi there,
I'm facing issue when trying to run the cudf_merge benchmark locally on a node that hosts 4 h100:

        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV6     NV6     NV6     SYS     SYS     SYS     24-31   3               N/A
GPU1    NV6      X      NV6     NV6     SYS     SYS     SYS     24-31   3               N/A
GPU2    NV6     NV6      X      NV6     SYS     SYS     SYS     40-47   5               N/A
GPU3    NV6     NV6     NV6      X      SYS     SYS     SYS     40-47   5               N/A
NIC0    SYS     SYS     SYS     SYS      X      PIX     SYS
NIC1    SYS     SYS     SYS     SYS     PIX      X      SYS
NIC2    SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

I can run the benchmark over any pair of GPUs with no issue:

python -m ucp.benchmarks.cudf_merge --devs 0,1 --chunk-size 200_000_000 --iter 10

ucx-py-cu12            0.40.0

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: pip install --upgrade pip
[1729696581.284731] [kh013:1596654:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696584.749090] [kh013:1596654:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696586.301065] [kh013:1596677:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696586.311799] [kh013:1596678:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696586.928897] [kh013:1596677:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696586.928901] [kh013:1596678:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696586.949084] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.949104] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.971682] [kh013:1596654:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.988770] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.992192] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.994004] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696586.994007] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696587.000277] [kh013:1596654:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696587.123809] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696587.135094] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696587.136850] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696587.137598] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
cuDF merge benchmark
--------------------------------------------------------------------------------------------------------------
Device(s)                 | [0, 1]
Chunks per device         | 1
Rows per chunk            | 200000000
Total data processed      | 119.21 GiB
Data processed per iter   | 11.92 GiB
Row matching fraction     | 0.3
==============================================================================================================
Wall-clock                | 3.00 s
Bandwidth                 | 24.88 GiB/s
Throughput                | 39.68 GiB/s
==============================================================================================================
Run                       | Wall-clock                | Bandwidth                 | Throughput
0                         | 161.36 ms                 | 108.39 GiB/s              | 73.88 GiB/s
1                         | 360.91 ms                 | 18.61 GiB/s               | 33.03 GiB/s
2                         | 455.87 ms                 | 13.33 GiB/s               | 26.15 GiB/s
3                         | 383.55 ms                 | 16.98 GiB/s               | 31.08 GiB/s
4                         | 169.31 ms                 | 90.74 GiB/s               | 70.41 GiB/s
5                         | 474.04 ms                 | 12.65 GiB/s               | 25.15 GiB/s
6                         | 293.38 ms                 | 25.85 GiB/s               | 40.63 GiB/s
7                         | 370.52 ms                 | 17.85 GiB/s               | 32.17 GiB/s
8                         | 161.21 ms                 | 108.41 GiB/s              | 73.95 GiB/s
9                         | 169.63 ms                 | 90.10 GiB/s               | 70.28 GiB/s

But it fails to run over the 4 devices:

python -m ucp.benchmarks.cudf_merge --devs 0,1,2,3 --chunk-size 200_000_000 --iter 10

[1729696635.679592] [kh013:1596934:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696639.178149] [kh013:1596934:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696642.222481] [kh013:1596952:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.243167] [kh013:1596955:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.245405] [kh013:1596953:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.247080] [kh013:1596954:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.977422] [kh013:1596952:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696642.980930] [kh013:1596955:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696642.997896] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.000134] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.012295] [kh013:1596954:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696643.014948] [kh013:1596953:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696643.020409] [kh013:1596934:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.035409] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.035662] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.039847] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.041863] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.043738] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.043739] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.046726] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.048626] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.049389] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.050116] [kh013:1596934:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.050260] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.076486] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.087261] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.089062] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.089778] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.103213] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.114753] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.176847] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.184527] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.185616] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.186623] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.187346] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.187424] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
Task exception was never retrieved
future: <Task finished name='Task-7' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f140, tag: 0xedf8353cc3df7250, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f140, tag: 0xedf8353cc3df7250, nbytes: 8, type: <class 'array.array'>>: 
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f080, tag: 0xd49e6a08b8eeaedd, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f080, tag: 0xd49e6a08b8eeaedd, nbytes: 8, type: <class 'array.array'>>: 
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f0c0, tag: 0xbe680e4915f49a08, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f0c0, tag: 0xbe680e4915f49a08, nbytes: 8, type: <class 'array.array'>>: 
Task exception was never retrieved
future: <Task finished name='Task-6' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f100, tag: 0xf1c3abccc2be7d01, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f100, tag: 0xf1c3abccc2be7d01, nbytes: 8, type: <class 'array.array'>>: 
^CProcess Process-1:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
Process Process-3:
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/cudf_merge.py", line 633, in <module>
Process Process-2:
    main()
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/cudf_merge.py", line 590, in main
    stats = [server_queue.get() for i in range(args.n_chunks)]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/cudf_merge.py", line 590, in <listcomp>
    stats = [server_queue.get() for i in range(args.n_chunks)]
             ^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/queues.py", line 103, in get
    res = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/connection.py", line 216, in recv_bytes
Traceback (most recent call last):
    buf = self._recv_bytes(maxlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
    buf = self._recv(4)
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 125, in _server_process
    ret = loop.run_until_complete(run())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete
    self.run_forever()
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/connection.py", line 395, in _recv
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/asyncio/base_events.py", line 1884, in _run_once
    event_list = self._selector.select(timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

My environment:

Package                Version
---------------------- -----------
cachetools             5.5.0
click                  8.1.7
cloudpickle            3.1.0
cuda-python            12.6.0
cudf-cu12              24.10.1
cupy-cuda12x           13.3.0
dask                   2024.9.0
dask-cudf-cu12         24.10.1
dask-expr              1.1.14
distributed            2024.9.0
fastrlock              0.8.2
fsspec                 2024.10.0
importlib_metadata     8.5.0
Jinja2                 3.1.4
libcudf-cu12           24.10.1
llvmlite               0.43.0
locket                 1.0.0
markdown-it-py         3.0.0
MarkupSafe             3.0.2
mdurl                  0.1.2
msgpack                1.1.0
numba                  0.60.0
numpy                  2.0.2
nvtx                   0.2.10
packaging              24.1
pandas                 2.2.2
partd                  1.4.2
pip                    23.2.1
psutil                 6.1.0
pyarrow                17.0.0
Pygments               2.18.0
pylibcudf-cu12         24.10.1
pynvjitlink-cu12       0.3.0
python-dateutil        2.9.0.post0
pytz                   2024.2
PyYAML                 6.0.2
rapids-dask-dependency 24.10.0
rich                   13.9.2
rmm-cu12               24.10.0
setuptools             65.5.0
six                    1.16.0
sortedcontainers       2.4.0
tblib                  3.0.0
toolz                  1.0.0
tornado                6.4.1
typing_extensions      4.12.2
tzdata                 2024.2
ucx-py-cu12            0.40.0
urllib3                2.2.3
zict                   3.0.0
zipp                   3.20.2

Any idea?

Also, I'm surprised by the variability of the benchmark over the successive 10 iterations.

And finally, is it expected that the benchmark saturate the available bandwidth between the GPUs?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions