rccl
rccl copied to clipboard
Merging NCCL 2.21.5-1
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: "Internal", or link to GitHub issue (if applicable).
What were the changes?
Merged NCCL v2.21.5-1 into RCCL develop.
Why were the changes made?
Must remain up-to-date with latest NCCL versions.
How was the outcome achieved?
Performed a git-merge, then resolved merge conflicts, then resolved compiler errors, then resolved logical conflicts due to functional changes in NCCL.
Additional Documentation:
NCCL Documentation:
- Add support for IB SHARP 1PPN operation with user buffers. Improve support for MNNVL, add NVLS support and multi-clique support.
- Detect the NVLS clique through NVML
- Exchange XML between peers in the same NVLS clique and fuse XMLs before creating the topology graph.
- Rework bootstrap allgather algorithms to allow for large allgather operations intra-node (XML exchange). Net/IB: add support for dynamic GID detection.
- Automatically select RoCEv2/IPv4 interface by default. Allow to select IPv6 or even the network/mask. Reduce NVLS memory usage.
- Add stepSize as property of a connection to allow for different sizes on different peers; set it to 128K for NVLink SHARP. Improve tuner loading
- Look for more paths, be more consistent with the network device plugin.
- Also search for tuner support inside the net plugin. Improve tuner API
- Add context to support multi-device per process. Add magic number around comm object to detect comm corruption.
- Add some basic check around communicators so that we can report a problem when a communicator gets corrupted or a wrong comm pointer is passed to NCCL. Fix net/IB error path. https://github.com/NVIDIA/nccl/pull/1164
- Fix collnet rail mapping with split comm.
- Fix packet reordering issue causing bootstrap mismatch
- Use a different tag in ncclTransportP2pSetup for the connectInfo exchange and the following barrier. Fix hang when crossNic is inconsistent between ranks. Fix minCompCap/maxCompCap computation. https://github.com/NVIDIA/nccl/issues/1184
Approval Checklist
Do not approve until these items are satisfied.
- [ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.