oci-hpc icon indicating copy to clipboard operation
oci-hpc copied to clipboard

[WIP] change to cross nic=2 to allow for alternating ring algo and nccl==2.23.4

Open functionstackx opened this issue 1 year ago • 2 comments

  • [x] change to NCCL_CROSS_NIC=2
  • [x] update from very old nccl==2.19.4 in ngc 24.01 to nccl==2.23.4 in ngc 24.12
  • [x] change to QPS_PER_CONNECTION=1 when within the same rail group as there is no hash collisions within the same rail group
  • [ ] TODO: add note about needing more QPs when about 1 tier of switching to increase enthropy
  • [x] remove nccl topo since NCCL graph search should be able to auto generate the topo on OCI's bare metal instances
  • [x] remove NCCL_NET_PLUGIN=none

max BW with alternating ring is 390GByte/s without it is 370GByte/s according to Sylvain's GTC24 NCCL talk

image image

functionstackx avatar Feb 26 '25 01:02 functionstackx

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA). The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

Thank you for this PR. We will update in the future version.

arnaudfroidmont avatar Oct 17 '25 15:10 arnaudfroidmont