ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Feature request: make different processes use different nearest NICs (in case when there are multiple nearest ones) when UCX_IB_PREFER_NEAREST_DEVICE is set

Open alexeedm opened this issue 3 years ago • 0 comments

Hey folks, I have a feature request related to UCX_IB_PREFER_NEAREST_DEVICE. Overall, I find this setting very useful and increasing the portability of the MPI code by a large margin. However, there is a particular case when this setting doesn't provide the best binding. Specifically, when 2 NICs have the same distance to the 2 processes on the node, both these processes would be assigned the same NIC and the other NIC may remain unused. For example, NVIDIA DGX-A100-like nodes have such an architecture: a PCIe switch connects two GPUs and 2 NICs together. Here's an illustration: https://www.microway.com/hpc-tech-tips/dgx-a100-review-throughput-and-hardware-summary/

I think it would be very helpful to have a kind of round-robin mapping policy in case of multiple closest NICs, which should solve the issue.

alexeedm avatar Jul 26 '22 13:07 alexeedm