CUDA Error when loading checkpoint on more than one GPU
Hello.
I am having an issue when using your code. If I try to resume a training from a checkpoint when having more than one GPU (I am using docker containers) I get the following error:
File "__main__.py", line 55, in <module>
main(parser.parse_args())
File "__main__.py", line 39, in main
spawn(train_distributed, args=(replica_count, port, args, params), nprocs=replica_count, join=True)
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/opt/diffwave/src/diffwave/learner.py", line 188, in train_distributed
_train_impl(replica_id, model, dataset, args, params)
File "/opt/diffwave/src/diffwave/learner.py", line 163, in _train_impl
learner.restore_from_checkpoint()
File "/opt/diffwave/src/diffwave/learner.py", line 95, in restore_from_checkpoint
checkpoint = torch.load(f'{self.model_dir}/{filename}.pt')
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 584, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 842, in _load
result = unpickler.load()
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 834, in persistent_load
load_tensor(data_type, size, key, _maybe_decode_ascii(location))
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 823, in load_tensor
loaded_storages[key] = restore_location(storage, location)
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 174, in default_restore_location
result = fn(storage, location)
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 156, in _cuda_deserialize
return obj.cuda(device)
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 77, in _cuda
return new_type(self.size()).copy_(self, non_blocking)
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
But when starting from scratch or using a single GPU this error does not show and the training goes flawlessly.
I must add that I have checked that the GPUs were completely free when launching the training.
Any advice on this issue?
Thanks in advance.
Hmm I haven't run across that error before. Sorry, I don't think I'll be of much help here.
It somehow spawns multiple processes on a single GPU, but only on one of them... I am launching the training on 4 GPUs. Three of the GPUs spawn each a single process, but one of them spawns 4 processes. If changing to 3 GPUs, the same happens: one of the GPUs spawns 3 processes. I cannot find a bug in the code that forces this behaviour....