ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Q: in-node performance of sm vs dc transports

Open angainor opened this issue 3 years ago • 0 comments

I am testing HPCX 2.10 (UCX 1.12, OpenMPI 4.1.2rc4) on a 2-socket EPYC 7742 system using osu_bibw benchmark. I test the in-node bandwidth: both ranks are started on the same node, each on a different socket. I test the performance of sm and dc transports in this setup.

mpirun -x UCX_TLS=<sm|dc>,self -np 2 -cpu-set 0,64 ./osu_bibw

The IB card is connected to socket 0.

In short, for large messages (128k+) it is much faster to use the DC transport instead of shared-memory communication. It is not intuitive: in both cases - unless I'm mistaken - the data has to cross the inter-socket GMI. Simple STREAM benchmarks shows me that the bandwidth of that component is around 16GB/s, which is consistent with the performance of the sm transport. On the other hand, dc gives 22GB/s.

So where does the additional dc performance come from? It looks like the data is not actually delivered to the other socket, because the bandwidth is like the full 0.5x HDR bandwidth. On the other hand, if I start both ranks on socket 1 (far from mlx5_0), then the bandwidth drops to ~17GB/s, which is similar to sm.

Does anyone have an explanation?

Here are the benchmark results for sm, and dc respectively:

mpirun -x UCX_TLS=sm,self -np 2 -cpu-set 0,64 ./osu_bibw 
# OSU MPI Bi-Directional Bandwidth Test v5.7
# Size      Bandwidth (MB/s)
1                      11.01
2                      22.00
4                      44.22
8                      88.16
16                    175.61
32                    348.94
64                    676.70
128                  1027.05
256                  1991.48
512                  3726.28
1024                 6476.68
2048                 9841.59
4096                13706.89
8192                16534.18
16384               14570.47
32768               19515.08
65536               20904.88
131072              16754.38
262144              16164.84
524288              16653.33
1048576             16007.03
2097152             15747.70
4194304             15597.56


mpirun -x UCX_TLS=dc,self -np 2 -cpu-set 0,64 ./osu_bibw 
# OSU MPI Bi-Directional Bandwidth Test v5.7
# Size      Bandwidth (MB/s)
1                       3.96
2                       7.52
4                      14.60
8                      29.30
16                     58.74
32                    116.60
64                    240.98
128                   468.97
256                   898.24
512                  1434.95
1024                 2471.51
2048                 3624.06
4096                 4734.44
8192                 8730.82
16384                8605.47
32768                8906.46
65536               10335.48
131072              20461.08
262144              21773.55
524288              22024.80
1048576             22089.42
2097152             21819.59
4194304             21459.80

# BOTH RANKS ON SOCKET 1
mpirun -x UCX_TLS=dc,self -np 2 -cpu-set 64,65 ./osu_bibw 
# OSU MPI Bi-Directional Bandwidth Test v5.7
# Size      Bandwidth (MB/s)
1                       3.54
2                       6.83
4                      13.50
8                      26.95
16                     53.81
32                    111.62
64                    241.91
128                   465.09
256                   885.68
512                  1454.20
1024                 2893.26
2048                 4805.69
4096                 6040.51
8192                10539.88
16384               11110.72
32768               11448.17
65536               11692.00
131072              10789.32
262144              17522.39
524288              17751.63
1048576             17830.33
2097152             17646.57
4194304             16769.72

angainor avatar May 12 '22 09:05 angainor