Q: in-node performance of sm vs dc transports
I am testing HPCX 2.10 (UCX 1.12, OpenMPI 4.1.2rc4) on a 2-socket EPYC 7742 system using osu_bibw benchmark. I test the in-node bandwidth: both ranks are started on the same node, each on a different socket. I test the performance of sm and dc transports in this setup.
mpirun -x UCX_TLS=<sm|dc>,self -np 2 -cpu-set 0,64 ./osu_bibw
The IB card is connected to socket 0.
In short, for large messages (128k+) it is much faster to use the DC transport instead of shared-memory communication. It is not intuitive: in both cases - unless I'm mistaken - the data has to cross the inter-socket GMI. Simple STREAM benchmarks shows me that the bandwidth of that component is around 16GB/s, which is consistent with the performance of the sm transport. On the other hand, dc gives 22GB/s.
So where does the additional dc performance come from? It looks like the data is not actually delivered to the other socket, because the bandwidth is like the full 0.5x HDR bandwidth. On the other hand, if I start both ranks on socket 1 (far from mlx5_0), then the bandwidth drops to ~17GB/s, which is similar to sm.
Does anyone have an explanation?
Here are the benchmark results for sm, and dc respectively:
mpirun -x UCX_TLS=sm,self -np 2 -cpu-set 0,64 ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.7
# Size Bandwidth (MB/s)
1 11.01
2 22.00
4 44.22
8 88.16
16 175.61
32 348.94
64 676.70
128 1027.05
256 1991.48
512 3726.28
1024 6476.68
2048 9841.59
4096 13706.89
8192 16534.18
16384 14570.47
32768 19515.08
65536 20904.88
131072 16754.38
262144 16164.84
524288 16653.33
1048576 16007.03
2097152 15747.70
4194304 15597.56
mpirun -x UCX_TLS=dc,self -np 2 -cpu-set 0,64 ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.7
# Size Bandwidth (MB/s)
1 3.96
2 7.52
4 14.60
8 29.30
16 58.74
32 116.60
64 240.98
128 468.97
256 898.24
512 1434.95
1024 2471.51
2048 3624.06
4096 4734.44
8192 8730.82
16384 8605.47
32768 8906.46
65536 10335.48
131072 20461.08
262144 21773.55
524288 22024.80
1048576 22089.42
2097152 21819.59
4194304 21459.80
# BOTH RANKS ON SOCKET 1
mpirun -x UCX_TLS=dc,self -np 2 -cpu-set 64,65 ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.7
# Size Bandwidth (MB/s)
1 3.54
2 6.83
4 13.50
8 26.95
16 53.81
32 111.62
64 241.91
128 465.09
256 885.68
512 1454.20
1024 2893.26
2048 4805.69
4096 6040.51
8192 10539.88
16384 11110.72
32768 11448.17
65536 11692.00
131072 10789.32
262144 17522.39
524288 17751.63
1048576 17830.33
2097152 17646.57
4194304 16769.72