SoftGroup
SoftGroup copied to clipboard
Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809861 milliseconds before timing out
This error happens when in the validation step. Have you met this problem?
My environment is cuda11.3, python3.7, pytorch 1.10
Same with you. Have you solved it?
Is this problem on custom dataset?