SoftGroup icon indicating copy to clipboard operation
SoftGroup copied to clipboard

Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809861 milliseconds before timing out

Open c6376315qqso opened this issue 3 years ago • 3 comments

This error happens when in the validation step. Have you met this problem?

c6376315qqso avatar Jun 02 '22 17:06 c6376315qqso

My environment is cuda11.3, python3.7, pytorch 1.10

c6376315qqso avatar Jun 02 '22 17:06 c6376315qqso

Same with you. Have you solved it?

weiguangzhao avatar Jul 02 '22 13:07 weiguangzhao

Is this problem on custom dataset?

thangvubk avatar Sep 05 '22 12:09 thangvubk