ompi icon indicating copy to clipboard operation
ompi copied to clipboard

MPI-RMA - performance issues with `MPI_Get`

Open thomasgillis opened this issue 3 years ago • 1 comments

Background information

my application relies on several calls to MPI_Get (a few hundreds per sync calls, like 200-600) with messages of small sizes (64 bytes to 9k roughly). I observe a very strong performance decrease when going from one node to multiple nodes.

This issues relates to the comment of @bosilca here

There seems to be a performance issue with the one-sided support in UCX. I used the OSU get_bw benchmark, with all types of synchronizations (fence, flush, lock/unlock) and while there is some variability the performance is consistently at a fraction of the point-to-point performance (about 1/100). Even switching the RMA support over TCP is about 20 times faster (mpirun -np 2 --map-by node --mca pml ob1 --mca osc rdma --mca btl_t_if_include ib0 ../pt2pt/osu_bw).

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

~~OpenMPI 4.1.4 + UCX 1.12.1 but the issue is similar on OpenMPI 4.1.2 with ugni~~

OpenMPI 4.1.2 with ugni

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

  • ~~ompi4.14 with easybuild with ucx 1.12.1 and ofi 1.14.0~~
  • openmpi 4.1.2 (by the cori support team).

Please describe the system on which you are running

  • ~~openmpi 4.1.4 runs on Infiniband HDR 200Gbps, with large nodes (128 cores/nodes)~~
  • openmpi 4.1.2 runs on cray network

Details of the problem

application issue

EDIT: The issues on the IB cluster are solved now thanks to the support team

On the cray cluster, using a weak scaling approach (5.3M unknowns per rank) the MPI_Gets time goes from 0.7308 sec on 1 node to 17.6264 sec on 8 nodes (for the same part of the code).

~~Similar results are observed on the IB cluster (10M unknowns per rank) where on a single node the average bandwidth measured is 260-275 Mb/s while 8 nodes are down to 210-220Mb/s (the theoretical bandwidth is 200Gb/s). From a timing perspective, the MPI_Get calls experience a more "normal" increase of the computational time from 1.0665 sec to 01.2820 secs.~~

Those numbers have been obtained using MPI_Win_allocate and MPI_create_hvector datatypes. In a previous version of the code using MPI_Win_create the one node case used to be as slow as the 8 nodes ones.

osu benchmarks - IB network

~~Following previous comments I have also run the OSU benchmark osu_get_bw for several number of calls per synchronization and the different memory allocation (see below). I compare the bandwidth measured between 2 ranks on the same node or on different nodes. Both cases barely make it to 25Gb/s while the network is supposed to deliver 200Gb/s.~~

questions

  • on the cray network: how can I reduce the performance loss
  • ~~on the IB network: while the performances seem reasonable, I am confused by the measure bandwidth (both osu and real-life application). Is there any good reason for the measured bandwidth to be so low?~~

other related questions:

  • what is the expected influence of MPI_Alloc_mem on performances for IB networks? are the gain specific to RMA or is it better for every MPI calls?
  • what is the influence of export OMPI_MCA_pml_ucx_multi_send_nb=1? It's set to 0 by default on my configuration.

~~At this stage it's not clear to me if there is indeed a performance issue or if it's the best the implementation can do Also maybe the configuration is not appropriate for the use we have of MPI-RMA.~~

I will be happy to try any suggestion you might have. Thanks for your help!

thomasgillis avatar Jul 16 '22 02:07 thomasgillis

@open-mpi/ucx FYI

jsquyres avatar Jul 21 '22 17:07 jsquyres