Performance issue, NVidia H100
I am seeing bad performance with in-node cross-GPU data exchanges on nodes with two H100 cards on PCI-E.
First, I have seen #9287. Using the master branch instead of 1.15.0 solves the major performance issue I saw both in 1.14 and 1.15 (~300MB/s transfers). However, even with the master branch I see problems: while #9287 used off-node transfers, I am transferring data over PCI-E. This is osu_bibw D D, two ranks, each rank runs on a different GPU, with current UCX master + OpenMPI 4.1.6:
mpirun -x UCX_RNDV_THRESH=1024 -np 2 ~/gpurun.sh /cluster/software/OSU/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -W 10 D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 1.02
2 1.40
4 2.71
8 5.55
16 17.68
32 35.15
64 67.70
128 139.44
256 258.88
512 385.37
1024 339.21
2048 303.27
4096 597.08
8192 1184.56
16384 2312.20
32768 3804.01
65536 3788.16
131072 3875.80
262144 3758.54
524288 4516.19
1048576 4734.75
2097152 4745.47
4194304 4767.85
So max bandwidth is ~4.7GB/s. Using the older UCX/1.11.2 I see more than 7GB/s (but there are some errors):
[1698937808.560759] [gpu-9.fox:1895231:0] sys.c:138 UCX ERROR mremap(oldptr=0x7f14aef1a000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[...]
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 1.03
2 1.31
4 2.65
8 5.31
16 16.35
32 32.34
64 62.32
128 130.75
256 241.94
512 412.85
1024 820.47
2048 1121.45
4096 2066.46
8192 3502.37
16384 5119.20
32768 6358.19
65536 6147.08
131072 6727.43
262144 6878.29
524288 7081.54
1048576 7294.67
2097152 7302.49
4194304 6913.40
Now, this system has a PCI-E 4 16x. I believe the bandwidth should be closer to 32 GB/s. On another system we have, based on multiple A100 cards and PCI-E 4 16x, I get the following performance with current master:
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.89
2 1.32
4 2.69
8 5.41
16 17.59
32 34.58
64 66.36
128 134.28
256 206.07
512 301.51
1024 1681.90
2048 2744.90
4096 4134.45
8192 6070.76
16384 6147.47
32768 10635.98
65536 15865.66
131072 19908.34
262144 23061.34
524288 24824.82
1048576 25479.03
2097152 26046.26
4194304 27617.73
Which is much better and closer to what I think should be achievable.
On both systems the nvidia cards are connected using PCI-E (no NV links): H100:
GPU0 GPU1 NIC0 NIC1 NIC2 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE SYS SYS SYS 48-95,144-191 1 N/A
GPU1 NODE X SYS SYS SYS 48-95,144-191 1 N/A
A100:
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE SYS SYS NODE SYS SYS 0-47,96-143 0 N/A
GPU1 NODE X SYS SYS NODE SYS SYS 0-47,96-143 0 N/A
GPU2 SYS SYS X NODE SYS NODE NODE 48-95,144-191 1 N/A
GPU3 SYS SYS NODE X SYS NODE NODE 48-95,144-191 1 N/A
Do you have any ideas why the H100 performance is so low?
Does nvidia-smi report any errors? And your H100's are hanging on the second CPU it seems, and when your MPI-processes are running on the 1st physical CPU, then it needs to push everything over the QPI-links. But this is just a guess. I'm unable to "confirm" your bad numbers with our setup.
everything with cuda-12.2.2, hwloc 2.10.0, openmpi-4.1.6, gcc-12.3.0 (and just ignored any affinity stuff, so here might also be some more moving everything between the CPUs happen. Especially with host to device or the other way around)
Here some quick and dirty tests:
2 x AMD EPYC 9354 32-Core Processor + 2 x Nvidia H100 + 6.1.65-1.el9.elrepo.x86_64 + AlmaLinux-9.3
$ nvidia-smi topo -m
GPU0 GPU1 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS NODE NODE NODE NODE 0-31,64-95 0 N/A
GPU1 SYS X SYS SYS SYS SYS 32-63,96-127 1 N/A
NIC0 NODE SYS X PIX NODE NODE
NIC1 NODE SYS PIX X NODE NODE
NIC2 NODE SYS NODE NODE X PIX
NIC3 NODE SYS NODE NODE PIX X
$ sudo lspci -d 10de:2331 -v -v | grep Spee
[sudo] password for sebo:
LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
LnkSta: Speed 32GT/s (ok), Width x16 (ok)
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
LnkSta: Speed 32GT/s (ok), Width x16 (ok)
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
ucx-main-branch:
$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-git-clone/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864 D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.16
2 0.32
4 0.64
8 1.29
16 2.57
32 5.03
64 10.19
128 20.31
256 40.27
512 79.23
1024 606.23
2048 1214.16
4096 2438.31
8192 4661.27
16384 9145.66
32768 17175.89
65536 24781.16
131072 31401.63
262144 33786.10
524288 35493.97
1048576 35718.08
2097152 35941.49
4194304 35868.93
8388608 35656.63
16777216 35182.82
33554432 34752.11
67108864 34237.82
ucx-1.15.0
$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.15.0/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864 D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.16
2 0.31
4 0.63
8 1.25
16 2.50
32 4.90
64 9.80
128 19.46
256 38.80
512 76.64
1024 417.76
2048 834.74
4096 1674.66
8192 3271.89
16384 6497.69
32768 12530.12
65536 19835.57
131072 28557.03
262144 32903.66
524288 33987.58
1048576 35431.07
2097152 35743.33
4194304 35778.17
8388608 35630.94
16777216 35330.13
33554432 34697.19
67108864 34249.71
ucx-1.14.1
$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.14.1/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864 D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.16
2 0.31
4 0.63
8 1.25
16 2.51
32 4.90
64 9.72
128 19.56
256 38.75
512 76.85
1024 410.21
2048 820.23
4096 1642.47
8192 3209.85
16384 6363.25
32768 12322.44
65536 19256.42
131072 28332.57
262144 32255.84
524288 34561.41
1048576 35211.36
2097152 35686.29
4194304 35777.43
8388608 35626.51
16777216 34287.17
33554432 34719.13
67108864 34212.97
ucx-1.11.2
$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.11.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864 D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.16
2 0.31
4 0.62
8 1.24
16 2.48
32 4.87
64 9.49
128 19.07
256 38.07
512 75.36
1024 496.60
2048 996.20
4096 2051.20
8192 4101.27
16384 7300.18
32768 11175.04
65536 17885.63
131072 20314.19
262144 23944.71
524288 26245.95
1048576 28118.23
2097152 29170.23
4194304 29571.97
8388608 29825.39
16777216 29966.92
33554432 29971.00
67108864 30026.41
same, but host-to-device
$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.11.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864 H D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on HOST (H) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.15
2 0.31
4 0.62
8 1.24
16 2.47
32 4.82
64 9.45
128 18.97
256 37.91
512 75.76
1024 254.39
2048 503.93
4096 989.90
8192 1822.85
16384 3405.85
32768 6052.50
65536 9507.08
131072 13055.87
262144 15623.59
524288 19492.59
1048576 20038.86
2097152 20068.27
4194304 20184.63
8388608 18522.25
16777216 16399.14
33554432 14517.47
67108864 10629.15
Now 2 x Nvidia A100, Scientific-Linux-7.9 + 5.4.264-1.el7.elrepo.x86_64 2 x Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
$ nvidia-smi topo -m
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS 0-15 0 N/A
GPU1 SYS X NODE 16-31 1 N/A
NIC0 SYS NODE X
/sbin/lspci -d 10de:20f1 -v -v | grep Spee
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
ucx-1.15.0
$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.15.0-sl73/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864 D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.11
2 0.22
4 0.44
8 0.88
16 1.75
32 3.46
64 6.89
128 13.71
256 27.33
512 54.36
1024 431.73
2048 857.33
4096 1728.91
8192 3444.31
16384 6179.09
32768 10163.98
65536 13470.57
131072 16026.17
262144 17572.32
524288 18693.38
1048576 19060.35
2097152 19276.69
4194304 19417.53
8388608 19493.99
16777216 19531.11
33554432 19548.85
67108864 19558.72
and host to device
$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.15.0-sl73/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864 H D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on HOST (H) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.11
2 0.22
4 0.44
8 0.89
16 1.77
32 3.50
64 7.04
128 13.96
256 27.77
512 55.20
1024 143.72
2048 276.77
4096 532.09
8192 952.00
16384 1149.04
32768 1273.71
65536 1336.67
131072 1373.89
262144 1397.90
524288 1381.19
1048576 1328.04
2097152 1333.91
4194304 1334.54
8388608 1535.24
16777216 1479.22
33554432 1372.80
67108864 1227.54
and device to host
$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 /zhome/31/b/80425/local/openmpi-cuda-12.2.2-ucx-1.15.0-sl73/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864 D H
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on HOST (H)
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 0.12
2 0.23
4 0.47
8 0.94
16 1.87
32 3.69
64 7.41
128 14.74
256 29.34
512 58.63
1024 155.63
2048 302.91
4096 564.90
8192 1026.56
16384 1252.63
32768 1378.11
65536 1455.18
131072 1485.79
262144 1491.67
524288 1456.78
1048576 1381.46
2097152 1385.11
4194304 1403.51
8388608 1601.55
16777216 1540.12
33554432 1430.35
67108864 1272.12