Adding a barrier feature to highlight delays from preceding kernels in RCCL calls
Details
Do not mention proprietary info or link to internal work items in this PR.
**Work item:**LWPCOMMLIBS-519
What were the changes?
This commit implements a barrier to expose delays introduced by the preceding kernel. Currently supported only on single-node setups.
Barrier can be enabled with RCCL_INSERT_BARRIER=1
Why were the changes made?
Captures delays caused by preceding kernels, which manifest in NCCL calls.
How was the outcome achieved?
Additional Documentation:
Approval Checklist
Do not approve until these items are satisfied.
- [ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.
The title itself needs clarification: this will not mitigate any delays caused by preceding kernel. It only exposes the delays to more visible. This may deflect issues from RCCL, but end of day, user need root cause and a fix of this issue. Only saying this is not caused by RCCL is not very helpful. Better method is to capture rocprof traces and understand which preceding kernels causing the delay.
The title itself needs clarification: this will not mitigate any delays caused by preceding kernel. It only exposes the delays to more visible. This may deflect issues from RCCL, but end of day, user need root cause and a fix of this issue. Only saying this is not caused by RCCL is not very helpful. Better method is to capture rocprof traces and understand which preceding kernels causing the delay.
Yeah, I think the intent here is that this feature reduces the number of tickets that get filed against RCCL and provides a quick way for calculating what percentage speedup one can actually get from optimizing RCCL (faster ticket resolution). But I generally agree that analysis of torch profiler traces can get you similar information if not more.
The title itself needs clarification: this will not mitigate any delays caused by preceding kernel. It only exposes the delays to more visible. This may deflect issues from RCCL, but end of day, user need root cause and a fix of this issue. Only saying this is not caused by RCCL is not very helpful. Better method is to capture rocprof traces and understand which preceding kernels causing the delay.
Yeah, I think the intent here is that this feature reduces the number of tickets that get filed against RCCL and provides a quick way for calculating what percentage speedup one can actually get from optimizing RCCL (faster ticket resolution). But I generally agree that analysis of torch profiler traces can get you similar information if not more.
If we only need to look at RCCL starting variation, such feature has been merged on develop from https://github.com/ROCm/rccl/pull/1785. This implementation doesn't make any kernel side change. It also logs latency from every kernel launch, so outliers can be seen easily.