DeepSpeed 8cards v100 training failures

(deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node (test) [menkeyi@workstation DeepSpeed-Chat]$ tail -f output/actor-models/1.3b/training.log gpu1:1329:1670 [0] NCCL INFO comm 0x43f61e20 rank 0 nranks 8 cudaDev 0 busId 4f000 - Init COMPLETE gpu1:1335:1676 [6] NCCL INFO comm 0x44ca90c0 rank 6 nranks 8 cudaDev 6 busId d5000 - Init COMPLETE gpu1:1331:1672 [2] NCCL INFO comm 0x43da60a0 rank 2 nranks 8 cudaDev 2 busId 56000 - Init COMPLETE gpu1:1334:1675 [5] NCCL INFO comm 0x448f1ac0 rank 5 nranks 8 cudaDev 5 busId d1000 - Init COMPLETE gpu1:1336:1682 [7] NCCL INFO comm 0x4335e050 rank 7 nranks 8 cudaDev 7 busId d6000 - Init COMPLETE gpu1:1333:1671 [4] NCCL INFO comm 0x43f0f350 rank 4 nranks 8 cudaDev 4 busId ce000 - Init COMPLETE gpu1:1332:1684 [3] NCCL INFO comm 0x456c6620 rank 3 nranks 8 cudaDev 3 busId 57000 - Init COMPLETE gpu1:1330:1678 [1] NCCL INFO comm 0x42f72dc0 rank 1 nranks 8 cudaDev 1 busId 52000 - Init COMPLETE gpu1:1334:1679 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration gpu1:1329:1677 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration gpu1:1330:1681 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration gpu1:1331:1674 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration gpu1:1333:1673 [0] transport/net_ib.cc:93 NCCL WARN NET/IB : Got async event : client reregistration

The training log is not leaving，Single card v100 can run and train normally

Apr 17 '23 08:04 menkeyi

Hi @menkeyi, I wasn't able to repro this bug locally on a 8x v100 single node setup. Are you seeing this as a reoccurring issue, or an intermittent one?

If it is reoccurring could you give more information about the single node setup: OS, Pytorch, CUDA, and NCCL versions; as well as the NCCL info that is printed at the start of the training.log file?

Here is mine as an example:

Thanks.

Apr 19 '23 22:04 jomayeri

Closing due to no response. Please re-open if needed.

Apr 28 '23 18:04 tjruwase