Decent
Decent copied to clipboard
Re-construct code for supporting DDP framework
Hi,
I try to re-construct your code with DDP for faster multi-GPU training. However, I encountered some issue.
File "/data2/liurzh/Decent_parallel/train.py", line 243, in <module>
main()
File "/data2/liurzh/Decent_parallel/train.py", line 109, in main
model.optimize_parameters() # calculate loss functions, get gradients, update network weights
File "/data2/liurzh/Decent_parallel/models/decent_gan_model.py", line 126, in optimize_parameters
self.loss_D = self.compute_D_loss()
File "/data2/liurzh/Decent_parallel/models/decent_gan_model.py", line 162, in compute_D_loss
pred_fake = self.netD(fake)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/data2/liurzh/Decent_parallel/models/networks.py", line 1486, in forward
return self.model(input)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/container.py", line 250, in forward
input = module(input)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/data2/liurzh/Decent_parallel/models/networks.py", line 61, in forward
return F.conv2d(self.pad(inp), self.filt, stride=self.stride, groups=inp.shape[1])
(Triggered internally at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: Traceback (most recent call last):
[rank0]: File "/data2/liurzh/Decent_parallel/train.py", line 243, in <module>
[rank0]: main()
[rank0]: File "/data2/liurzh/Decent_parallel/train.py", line 109, in main
[rank0]: model.optimize_parameters() # calculate loss functions, get gradients, update network weights
[rank0]: File "/data2/liurzh/Decent_parallel/models/decent_gan_model.py", line 127, in optimize_parameters
[rank0]: self.loss_D.backward()
[rank0]: File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 1, 3, 3]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
[rank0]:[W529 08:32:03.448938579 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0529 08:32:09.141000 1181402 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1186218) of binary: /data3/liurzh/miniconda3/envs/dhc/bin/python
The torch version is 2.6, coda version is 12.4 python version is 3.10. How so solve this issue?