rccl icon indicating copy to clipboard operation
rccl copied to clipboard

[Issue]: Algorithm and protocol selection

Open tks2004 opened this issue 1 year ago • 14 comments

Problem Description

Want to check the Algorithm and protocol being selected for all_reduce_perf. Uncommented line 1520 in enqueue.cc, but none print statements were seen. How do we check the algorithm and protocol selection

Operating System

SLES15.4

CPU

AMD EPYC 7A53

GPU

AMD Instinct MI250

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

tks2004 avatar Apr 10 '24 12:04 tks2004

Tried modifying enqueue.cc at line no: 1273) as well. did not get the algo or communication protocol

tks2004 avatar Apr 10 '24 16:04 tks2004

How many logical GPUs on single node? 8 or 16? If there are 16 GPUs, it is possible MSCCL path is taken.

wenkaidu avatar Apr 10 '24 21:04 wenkaidu

It is 4 GPUs and was run using all 8 GCD. Message sizes are from 2MB to 2GB; NCCL INFO Connected all trees NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 256 | 256 NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 1 p2p channels per peer NCCL INFO MSCCL: No external scheduler found, using internal implementation NCCL INFO Using MSCCL files from /opt/rocm-6.0.0/lib/../share/rccl/msccl-algorithms NCCL INFO MSCCL: Initialization finished, localSize 400

tks2004 avatar Apr 11 '24 10:04 tks2004

There are 2 possibilities:

  1. NCCL_DEBUG_SUBSYS=INIT,TUNING is missing from environment;
  2. Default RCCL build from ROCm release is used. Please commit your change and build, then confirm commit hash matching "RCCL version" in log;

wenkaidu avatar Apr 11 '24 14:04 wenkaidu

@tks2004 - Did you get a chance to implement @wenkaidu's suggestions?

haripriya-amd avatar Apr 29 '24 18:04 haripriya-amd

@haripriya-amd, rccl-tests are failing if CUSTOM_RCCL_LIB is used.

tks2004 avatar Apr 30 '24 14:04 tks2004

Can you share failure message?

wenkaidu avatar May 01 '24 00:05 wenkaidu

Was able to work around this issue. How does MSCCL is selected, I now dont see it being selected,

tks2004 avatar May 08 '24 13:05 tks2004

MSCCL is only supported on very limited hardware platforms. It also only supports limited collectives and data sizes. It will be automatically activated when possible.

wenkaidu avatar May 08 '24 14:05 wenkaidu

Is there an option to force use MSCCL algorithms

tks2004 avatar May 08 '24 16:05 tks2004

Can you share more details on targeted hardware platform and application? MSCCL algorithm only works well when GPU have all to all connectivity. Currently it only supports small data sizes in one GPU per process mode.

wenkaidu avatar May 08 '24 16:05 wenkaidu

This is on MI250x (4 GPUs; 8GCDs) using Slingshot interconnect. Had used latest rccl, rccl-tests and aws-ofi-rccl from master branch.

tks2004 avatar May 08 '24 16:05 tks2004

MSCCL can be force enabled by RCCL_MSCCL_FORCE_ENABLE=1: https://github.com/ROCm/rccl/blob/develop/src/misc/msccl/msccl_lifecycle.cc#L26 This will pickup xmls for 8 GPUs. You may need to adjust min/max bytes to see what data sizes it may help.

wenkaidu avatar May 08 '24 17:05 wenkaidu

Please use rccl/rccl-tests develop branches, not master which may not be updated promptly.

wenkaidu avatar May 08 '24 17:05 wenkaidu

RCCL_MSCCL_FORCE_ENABLE=1 did not force MSCCL algorithm

tks2004 avatar May 22 '24 10:05 tks2004

Was able to get the protocol and algorithm info.

tks2004 avatar Jun 10 '24 06:06 tks2004