Performance issue, NVidia H100

Open angainor opened this issue 2 years ago • 1 comments

I am seeing bad performance with in-node cross-GPU data exchanges on nodes with two H100 cards on PCI-E.

First, I have seen #9287. Using the master branch instead of 1.15.0 solves the major performance issue I saw both in 1.14 and 1.15 (~300MB/s transfers). However, even with the master branch I see problems: while #9287 used off-node transfers, I am transferring data over PCI-E. This is osu_bibw D D, two ranks, each rank runs on a different GPU, with current UCX master + OpenMPI 4.1.6:

mpirun  -x UCX_RNDV_THRESH=1024 -np 2 ~/gpurun.sh /cluster/software/OSU/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -W 10 D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       1.02
2                       1.40
4                       2.71
8                       5.55
16                     17.68
32                     35.15
64                     67.70
128                   139.44
256                   258.88
512                   385.37
1024                  339.21
2048                  303.27
4096                  597.08
8192                 1184.56
16384                2312.20
32768                3804.01
65536                3788.16
131072               3875.80
262144               3758.54
524288               4516.19
1048576              4734.75
2097152              4745.47
4194304              4767.85

So max bandwidth is ~4.7GB/s. Using the older UCX/1.11.2 I see more than 7GB/s (but there are some errors):

[1698937808.560759] [gpu-9.fox:1895231:0]              sys.c:138  UCX  ERROR mremap(oldptr=0x7f14aef1a000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[...]

# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       1.03
2                       1.31
4                       2.65
8                       5.31
16                     16.35
32                     32.34
64                     62.32
128                   130.75
256                   241.94
512                   412.85
1024                  820.47
2048                 1121.45
4096                 2066.46
8192                 3502.37
16384                5119.20
32768                6358.19
65536                6147.08
131072               6727.43
262144               6878.29
524288               7081.54
1048576              7294.67
2097152              7302.49
4194304              6913.40

Now, this system has a PCI-E 4 16x. I believe the bandwidth should be closer to 32 GB/s. On another system we have, based on multiple A100 cards and PCI-E 4 16x, I get the following performance with current master:

# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.89
2                       1.32
4                       2.69
8                       5.41
16                     17.59
32                     34.58
64                     66.36
128                   134.28
256                   206.07
512                   301.51
1024                 1681.90
2048                 2744.90
4096                 4134.45
8192                 6070.76
16384                6147.47
32768               10635.98
65536               15865.66
131072              19908.34
262144              23061.34
524288              24824.82
1048576             25479.03
2097152             26046.26
4194304             27617.73

Which is much better and closer to what I think should be achievable.

On both systems the nvidia cards are connected using PCI-E (no NV links): H100:

	GPU0	GPU1	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	SYS	SYS	SYS	48-95,144-191	1		N/A
GPU1	NODE	 X 	SYS	SYS	SYS	48-95,144-191	1		N/A

A100:

	GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	SYS	SYS	NODE	SYS	SYS	0-47,96-143	0		N/A
GPU1	NODE	 X 	SYS	SYS	NODE	SYS	SYS	0-47,96-143	0		N/A
GPU2	SYS	SYS	 X 	NODE	SYS	NODE	NODE	48-95,144-191	1		N/A
GPU3	SYS	SYS	NODE	 X 	SYS	NODE	NODE	48-95,144-191	1		N/A

Do you have any ideas why the H100 performance is so low?

Nov 02 '23 15:11 angainor

Does nvidia-smi report any errors? And your H100's are hanging on the second CPU it seems, and when your MPI-processes are running on the 1st physical CPU, then it needs to push everything over the QPI-links. But this is just a guess. I'm unable to "confirm" your bad numbers with our setup.

everything with cuda-12.2.2, hwloc 2.10.0, openmpi-4.1.6, gcc-12.3.0 (and just ignored any affinity stuff, so here might also be some more moving everything between the CPUs happen. Especially with host to device or the other way around)

Here some quick and dirty tests:

2 x AMD EPYC 9354 32-Core Processor + 2 x Nvidia H100 + 6.1.65-1.el9.elrepo.x86_64 + AlmaLinux-9.3

 $ nvidia-smi topo -m
	GPU0	GPU1	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	NODE	NODE	NODE	NODE	0-31,64-95	0		N/A
GPU1	SYS	 X 	SYS	SYS	SYS	SYS	32-63,96-127	1		N/A
NIC0	NODE	SYS	 X 	PIX	NODE	NODE				
NIC1	NODE	SYS	PIX	 X 	NODE	NODE				
NIC2	NODE	SYS	NODE	NODE	 X 	PIX
NIC3	NODE	SYS	NODE	NODE	PIX	 X

$ sudo lspci -d 10de:2331 -v -v | grep Spee
[sudo] password for sebo: 
		LnkCap:	Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
		LnkSta:	Speed 32GT/s (ok), Width x16 (ok)
		LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
		LnkCap:	Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
		LnkSta:	Speed 32GT/s (ok), Width x16 (ok)
		LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-

ucx-main-branch:

$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-git-clone/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864  D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.16
2                       0.32
4                       0.64
8                       1.29
16                      2.57
32                      5.03
64                     10.19
128                    20.31
256                    40.27
512                    79.23
1024                  606.23
2048                 1214.16
4096                 2438.31
8192                 4661.27
16384                9145.66
32768               17175.89
65536               24781.16
131072              31401.63
262144              33786.10
524288              35493.97
1048576             35718.08
2097152             35941.49
4194304             35868.93
8388608             35656.63
16777216            35182.82
33554432            34752.11
67108864            34237.82

ucx-1.15.0

 $ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.15.0/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864  D D 
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.16
2                       0.31
4                       0.63
8                       1.25
16                      2.50
32                      4.90
64                      9.80
128                    19.46
256                    38.80
512                    76.64
1024                  417.76
2048                  834.74
4096                 1674.66
8192                 3271.89
16384                6497.69
32768               12530.12
65536               19835.57
131072              28557.03
262144              32903.66
524288              33987.58
1048576             35431.07
2097152             35743.33
4194304             35778.17
8388608             35630.94
16777216            35330.13
33554432            34697.19
67108864            34249.71

ucx-1.14.1

$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.14.1/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864  D D 
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.16
2                       0.31
4                       0.63
8                       1.25
16                      2.51
32                      4.90
64                      9.72
128                    19.56
256                    38.75
512                    76.85
1024                  410.21
2048                  820.23
4096                 1642.47
8192                 3209.85
16384                6363.25
32768               12322.44
65536               19256.42
131072              28332.57
262144              32255.84
524288              34561.41
1048576             35211.36
2097152             35686.29
4194304             35777.43
8388608             35626.51
16777216            34287.17
33554432            34719.13
67108864            34212.97

ucx-1.11.2

$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.11.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864  D D 
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.16
2                       0.31
4                       0.62
8                       1.24
16                      2.48
32                      4.87
64                      9.49
128                    19.07
256                    38.07
512                    75.36
1024                  496.60
2048                  996.20
4096                 2051.20
8192                 4101.27
16384                7300.18
32768               11175.04
65536               17885.63
131072              20314.19
262144              23944.71
524288              26245.95
1048576             28118.23
2097152             29170.23
4194304             29571.97
8388608             29825.39
16777216            29966.92
33554432            29971.00
67108864            30026.41

same, but host-to-device

 $ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.11.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864  H D 
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on HOST (H) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.15
2                       0.31
4                       0.62
8                       1.24
16                      2.47
32                      4.82
64                      9.45
128                    18.97
256                    37.91
512                    75.76
1024                  254.39
2048                  503.93
4096                  989.90
8192                 1822.85
16384                3405.85
32768                6052.50
65536                9507.08
131072              13055.87
262144              15623.59
524288              19492.59
1048576             20038.86
2097152             20068.27
4194304             20184.63
8388608             18522.25
16777216            16399.14
33554432            14517.47
67108864            10629.15

Now 2 x Nvidia A100, Scientific-Linux-7.9 + 5.4.264-1.el7.elrepo.x86_64 2 x Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz

$ nvidia-smi topo -m
	GPU0	GPU1	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	SYS	0-15	0		N/A
GPU1	SYS	 X 	NODE	16-31	1		N/A
NIC0	SYS	NODE	 X 				

/sbin/lspci -d 10de:20f1 -v -v  | grep Spee
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-

ucx-1.15.0

 $ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.15.0-sl73/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864  D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.11
2                       0.22
4                       0.44
8                       0.88
16                      1.75
32                      3.46
64                      6.89
128                    13.71
256                    27.33
512                    54.36
1024                  431.73
2048                  857.33
4096                 1728.91
8192                 3444.31
16384                6179.09
32768               10163.98
65536               13470.57
131072              16026.17
262144              17572.32
524288              18693.38
1048576             19060.35
2097152             19276.69
4194304             19417.53
8388608             19493.99
16777216            19531.11
33554432            19548.85
67108864            19558.72

and host to device

 $ mpirun -x UCX_RNDV_THRESH=1024 -np 2 .../openmpi-cuda-12.2.2-ucx-1.15.0-sl73/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864  H D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on HOST (H) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.11
2                       0.22
4                       0.44
8                       0.89
16                      1.77
32                      3.50
64                      7.04
128                    13.96
256                    27.77
512                    55.20
1024                  143.72
2048                  276.77
4096                  532.09
8192                  952.00
16384                1149.04
32768                1273.71
65536                1336.67
131072               1373.89
262144               1397.90
524288               1381.19
1048576              1328.04
2097152              1333.91
4194304              1334.54
8388608              1535.24
16777216             1479.22
33554432             1372.80
67108864             1227.54

and device to host

$ mpirun -x UCX_RNDV_THRESH=1024 -np 2 /zhome/31/b/80425/local/openmpi-cuda-12.2.2-ucx-1.15.0-sl73/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m :67108864 D  H 
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on HOST (H)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.12
2                       0.23
4                       0.47
8                       0.94
16                      1.87
32                      3.69
64                      7.41
128                    14.74
256                    29.34
512                    58.63
1024                  155.63
2048                  302.91
4096                  564.90
8192                 1026.56
16384                1252.63
32768                1378.11
65536                1455.18
131072               1485.79
262144               1491.67
524288               1456.78
1048576              1381.46
2097152              1385.11
4194304              1403.51
8388608              1601.55
16777216             1540.12
33554432             1430.35
67108864             1272.12

Dec 25 '23 10:12 sb22bs