Wenkai Du
Wenkai Du
How many logical GPUs on single node? 8 or 16? If there are 16 GPUs, it is possible MSCCL path is taken.
There are 2 possibilities: 1. NCCL_DEBUG_SUBSYS=INIT,TUNING is missing from environment; 2. Default RCCL build from ROCm release is used. Please commit your change and build, then confirm commit hash matching...
Can you share failure message?
MSCCL is only supported on very limited hardware platforms. It also only supports limited collectives and data sizes. It will be automatically activated when possible.
Can you share more details on targeted hardware platform and application? MSCCL algorithm only works well when GPU have all to all connectivity. Currently it only supports small data sizes...
MSCCL can be force enabled by RCCL_MSCCL_FORCE_ENABLE=1: https://github.com/ROCm/rccl/blob/develop/src/misc/msccl/msccl_lifecycle.cc#L26 This will pickup xmls for 8 GPUs. You may need to adjust min/max bytes to see what data sizes it may help.
Please use rccl/rccl-tests develop branches, not master which may not be updated promptly.
> Can we put this PR on hold as current change doesn't produce gain. We may want to revisit later while we working on the AMD GDA item This PR...
The title itself needs clarification: this will not mitigate any delays caused by preceding kernel. It only exposes the delays to more visible. This may deflect issues from RCCL, but...
> > The title itself needs clarification: this will not mitigate any delays caused by preceding kernel. It only exposes the delays to more visible. This may deflect issues from...