rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Merging NCCL 2.21.5-1

Open corey-derochie-amd opened this issue 1 year ago • 0 comments

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
Merged NCCL v2.21.5-1 into RCCL develop.

Why were the changes made?
Must remain up-to-date with latest NCCL versions.

How was the outcome achieved?
Performed a git-merge, then resolved merge conflicts, then resolved compiler errors, then resolved logical conflicts due to functional changes in NCCL.

Additional Documentation:
NCCL Documentation:

  • Add support for IB SHARP 1PPN operation with user buffers. Improve support for MNNVL, add NVLS support and multi-clique support.
  • Detect the NVLS clique through NVML
  • Exchange XML between peers in the same NVLS clique and fuse XMLs before creating the topology graph.
  • Rework bootstrap allgather algorithms to allow for large allgather operations intra-node (XML exchange). Net/IB: add support for dynamic GID detection.
  • Automatically select RoCEv2/IPv4 interface by default. Allow to select IPv6 or even the network/mask. Reduce NVLS memory usage.
  • Add stepSize as property of a connection to allow for different sizes on different peers; set it to 128K for NVLink SHARP. Improve tuner loading
  • Look for more paths, be more consistent with the network device plugin.
  • Also search for tuner support inside the net plugin. Improve tuner API
  • Add context to support multi-device per process. Add magic number around comm object to detect comm corruption.
  • Add some basic check around communicators so that we can report a problem when a communicator gets corrupted or a wrong comm pointer is passed to NCCL. Fix net/IB error path. https://github.com/NVIDIA/nccl/pull/1164
  • Fix collnet rail mapping with split comm.
  • Fix packet reordering issue causing bootstrap mismatch
  • Use a different tag in ncclTransportP2pSetup for the connectInfo exchange and the following barrier. Fix hang when crossNic is inconsistent between ranks. Fix minCompCap/maxCompCap computation. https://github.com/NVIDIA/nccl/issues/1184

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

corey-derochie-amd avatar Jul 24 '24 03:07 corey-derochie-amd