rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Adding a barrier feature to highlight delays from preceding kernels in RCCL calls

Open mberenjk opened this issue 5 months ago • 3 comments

Details

Do not mention proprietary info or link to internal work items in this PR.

**Work item:**LWPCOMMLIBS-519

What were the changes?
This commit implements a barrier to expose delays introduced by the preceding kernel. Currently supported only on single-node setups. Barrier can be enabled with RCCL_INSERT_BARRIER=1

Why were the changes made?
Captures delays caused by preceding kernels, which manifest in NCCL calls.

How was the outcome achieved?

Additional Documentation:

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

mberenjk avatar Sep 26 '25 00:09 mberenjk

The title itself needs clarification: this will not mitigate any delays caused by preceding kernel. It only exposes the delays to more visible. This may deflect issues from RCCL, but end of day, user need root cause and a fix of this issue. Only saying this is not caused by RCCL is not very helpful. Better method is to capture rocprof traces and understand which preceding kernels causing the delay.

wenkaidu avatar Oct 01 '25 22:10 wenkaidu

The title itself needs clarification: this will not mitigate any delays caused by preceding kernel. It only exposes the delays to more visible. This may deflect issues from RCCL, but end of day, user need root cause and a fix of this issue. Only saying this is not caused by RCCL is not very helpful. Better method is to capture rocprof traces and understand which preceding kernels causing the delay.

Yeah, I think the intent here is that this feature reduces the number of tickets that get filed against RCCL and provides a quick way for calculating what percentage speedup one can actually get from optimizing RCCL (faster ticket resolution). But I generally agree that analysis of torch profiler traces can get you similar information if not more.

alex-breslow-amd avatar Oct 01 '25 22:10 alex-breslow-amd

The title itself needs clarification: this will not mitigate any delays caused by preceding kernel. It only exposes the delays to more visible. This may deflect issues from RCCL, but end of day, user need root cause and a fix of this issue. Only saying this is not caused by RCCL is not very helpful. Better method is to capture rocprof traces and understand which preceding kernels causing the delay.

Yeah, I think the intent here is that this feature reduces the number of tickets that get filed against RCCL and provides a quick way for calculating what percentage speedup one can actually get from optimizing RCCL (faster ticket resolution). But I generally agree that analysis of torch profiler traces can get you similar information if not more.

If we only need to look at RCCL starting variation, such feature has been merged on develop from https://github.com/ROCm/rccl/pull/1785. This implementation doesn't make any kernel side change. It also logs latency from every kernel launch, so outliers can be seen easily.

wenkaidu avatar Oct 01 '25 22:10 wenkaidu