Allow zero byte sendrecv in alltoallv

Open wenkaidu opened this issue 1 year ago • 0 comments

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable). Internal

What were the changes?
Allow zero byte sendrecv in alltoallv

Why were the changes made? From PyTorch code: https://github.com/NVIDIA/nccl/issues/696. The issue of skipping send/recv is that it can cause deadlock when a rank send and recv 0 bytes so it's completely skipping the collective, causing mismatch across ranks

How was the outcome achieved?
Allow zero byte sendrecv in alltoallv

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

[ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

Sep 25 '24 17:09 wenkaidu