rccl [Issue]: Algorithm and protocol selection

Problem Description

Want to check the Algorithm and protocol being selected for all_reduce_perf. Uncommented line 1520 in enqueue.cc, but none print statements were seen. How do we check the algorithm and protocol selection

Operating System

SLES15.4

CPU

AMD EPYC 7A53

GPU

AMD Instinct MI250

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Apr 10 '24 12:04 tks2004

Tried modifying enqueue.cc at line no: 1273) as well. did not get the algo or communication protocol

Apr 10 '24 16:04 tks2004

How many logical GPUs on single node? 8 or 16? If there are 16 GPUs, it is possible MSCCL path is taken.

Apr 10 '24 21:04 wenkaidu

It is 4 GPUs and was run using all 8 GCD. Message sizes are from 2MB to 2GB; NCCL INFO Connected all trees NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 256 | 256 NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 1 p2p channels per peer NCCL INFO MSCCL: No external scheduler found, using internal implementation NCCL INFO Using MSCCL files from /opt/rocm-6.0.0/lib/../share/rccl/msccl-algorithms NCCL INFO MSCCL: Initialization finished, localSize 400

Apr 11 '24 10:04 tks2004

There are 2 possibilities:

NCCL_DEBUG_SUBSYS=INIT,TUNING is missing from environment;
Default RCCL build from ROCm release is used. Please commit your change and build, then confirm commit hash matching "RCCL version" in log;

Apr 11 '24 14:04 wenkaidu

@tks2004 - Did you get a chance to implement @wenkaidu's suggestions?

Apr 29 '24 18:04 haripriya-amd

@haripriya-amd, rccl-tests are failing if CUSTOM_RCCL_LIB is used.

Apr 30 '24 14:04 tks2004

Can you share failure message?

May 01 '24 00:05 wenkaidu

Was able to work around this issue. How does MSCCL is selected, I now dont see it being selected,

May 08 '24 13:05 tks2004

MSCCL is only supported on very limited hardware platforms. It also only supports limited collectives and data sizes. It will be automatically activated when possible.

May 08 '24 14:05 wenkaidu

Is there an option to force use MSCCL algorithms

May 08 '24 16:05 tks2004

Can you share more details on targeted hardware platform and application? MSCCL algorithm only works well when GPU have all to all connectivity. Currently it only supports small data sizes in one GPU per process mode.

May 08 '24 16:05 wenkaidu

This is on MI250x (4 GPUs; 8GCDs) using Slingshot interconnect. Had used latest rccl, rccl-tests and aws-ofi-rccl from master branch.

May 08 '24 16:05 tks2004

MSCCL can be force enabled by RCCL_MSCCL_FORCE_ENABLE=1: https://github.com/ROCm/rccl/blob/develop/src/misc/msccl/msccl_lifecycle.cc#L26 This will pickup xmls for 8 GPUs. You may need to adjust min/max bytes to see what data sizes it may help.

May 08 '24 17:05 wenkaidu

Please use rccl/rccl-tests develop branches, not master which may not be updated promptly.

May 08 '24 17:05 wenkaidu

RCCL_MSCCL_FORCE_ENABLE=1 did not force MSCCL algorithm

May 22 '24 10:05 tks2004

Was able to get the protocol and algorithm info.

Jun 10 '24 06:06 tks2004