Errors when training with dataloader_num_workers > 0

Open huangjun12 opened this issue 1 year ago • 2 comments

error message

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

May 23 '24 09:05 huangjun12

I have the same issue. Is there any guide or updates coming to distributed training?

Sep 19 '24 05:09 jayhxmo

Putting a line of torch.multiprocessing.set_start_method('spawn', force=True) at the beginning of the training script seems to be sufficient based on some brief tests. However, I still feel that, theoretically, there could be potential bugs.

May 20 '25 03:05 santisy