STDC-Seg icon indicating copy to clipboard operation
STDC-Seg copied to clipboard

关于单机单卡遇上的报错

Open yang-stephen opened this issue 3 years ago • 3 comments

你好,非常想知道单卡单机训练时该怎么修改代码,自己尝试直接单卡运行,在此处报错 Traceback (most recent call last): File "train.py", line 429, in train() File "train.py", line 292, in train out, out16, out32, detail8 = net(im) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 885, in forward inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0]) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 992, in to_kwargs inputs = self._recursive_to(inputs, device_id) if inputs else [] File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 986, in _recursive_to res = to_map(inputs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 977, in to_map return list(zip(*map(to_map, obj))) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 955, in to_map if obj.device == torch.device("cuda", target_gpu): RuntimeError: Device index must not be negative

将local_rank默认值改为0后又有如下报错: Traceback (most recent call last): File "train.py", line 429, in train() File "train.py", line 292, in train out, out16, out32, detail8 = net(im) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/STDC-Seg/models/model_stages.py", line 272, in forward feat_res2, feat_res4, feat_res8, feat_res16, feat_cp8, feat_cp16 = self.cp(x) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/STDC-Seg/models/model_stages.py", line 141, in forward avg = self.conv_avg(avg) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/STDC-Seg/models/model_stages.py", line 31, in forward x = self.bn(x) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/STDC-Seg/modules/bn.py", line 118, in forward return inplace_abn_sync(x, self.weight, self.bias, self.running_mean, self.running_var, RuntimeError: Some elements marked as dirty during the forward method were not returned as output. The inputs that are modified inplace must all be outputs of the Function.

不知道该怎么解决,非常想知道怎么解决,或者希望能指点要去掉分布式训练该怎么修train.py,万分感谢!

yang-stephen avatar Apr 04 '22 13:04 yang-stephen

你好,请问你最后解决了这个问题,在单卡单机上训练成功了没有?

Lee6384 avatar Oct 16 '22 12:10 Lee6384

有英文回答解决了这个问题

Haodong-Lei-Ray avatar Nov 11 '22 04:11 Haodong-Lei-Ray

RuntimeError: Some elements marked as dirty during the forward method were not returned as output. The inputs that are modified inplace must all be outputs of the Function. @yang-stephen @Lee6384

报上述错误,BatchNorm2d函数不对,建议使用torch官方正则化函数,即nn.BatchNorm2d (model_stages.py里)

LingsiDS avatar Mar 09 '23 06:03 LingsiDS