[WIP] change to cross nic=2 to allow for alternating ring algo and nccl==2.23.4
- [x] change to
NCCL_CROSS_NIC=2 - [x] update from very old
nccl==2.19.4in ngc 24.01 tonccl==2.23.4in ngc 24.12 - [x] change to
QPS_PER_CONNECTION=1when within the same rail group as there is no hash collisions within the same rail group - [ ] TODO: add note about needing more QPs when about 1 tier of switching to increase enthropy
- [x] remove nccl topo since NCCL graph search should be able to auto generate the topo on OCI's bare metal instances
- [x] remove NCCL_NET_PLUGIN=none
max BW with alternating ring is 390GByte/s without it is 370GByte/s according to Sylvain's GTC24 NCCL talk
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA). The following contributors of this PR have not signed the OCA:
- PR author: OrenLeung
- [email protected] (@OrenLeung)
To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.
When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.
If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.
Thank you for this PR. We will update in the future version.