jacksonlee02365894
Results
1
issues of
jacksonlee02365894
同样的数据集,单卡训练都是正常的,但是多卡训练的时候,会报错。 已经尝试过逐步减小batch_size,并没有解决问题。 [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3437, OpType=ALLREDUCE, NumelIn=4311252, NumelOut=4311252, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective...
question