rccl icon indicating copy to clipboard operation
rccl copied to clipboard

[gfx950] Turn On Single Node One Slice Optimization for gfx950 and MI300A

Open alex-breslow-amd opened this issue 4 months ago • 0 comments

Details

Work item: LWPCLPAT-615

What were the changes?
Turn on one-slice optimization for MI300A and gfx950

Why were the changes made?
Performance improvements for single node

How was the outcome achieved?
Slices are used by RCCL to help hide latency by having pipelining. It turns out that on some single node systems, this isn't necessary, and it's actually better to turn it off. Without this optimization, the code switches from one slice to two right around 14 MiB at least for MI300X.

Additional Documentation:
May extend to AllToAll, still testing.

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

alex-breslow-amd avatar Oct 28 '25 23:10 alex-breslow-amd