OBBDetection icon indicating copy to clipboard operation
OBBDetection copied to clipboard

S2ANet 多卡训练报错

Open hust-lidelong opened this issue 4 years ago • 4 comments

您好,当我运行 S2ANet,单卡训练是正常的,但是多卡训练就报错:

Traceback (most recent call last): File "/home/lidelong/data/code/OBBDetection/./tools/train.py", line 162, in main() File "/home/lidelong/data/code/OBBDetection/./tools/train.py", line 151, in main train_detector( File "/home/lidelong/data/code/OBBDetection/mmdet/apis/train.py", line 136, in train_detector runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/lidelong/miniconda3/envs/obbdet/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/lidelong/miniconda3/envs/obbdet/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/home/lidelong/miniconda3/envs/obbdet/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/home/lidelong/miniconda3/envs/obbdet/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 42, in train_step and self.reducer._rebuild_buckets()): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

hust-lidelong avatar Dec 14 '21 03:12 hust-lidelong

参考MMDet的文档 doc, 可以试试在config文件中增加find_unused_parameters = True

liuyanyi avatar Dec 14 '21 05:12 liuyanyi

参考MMDet的文档 doc, 可以试试在config文件中增加find_unused_parameters = True

感谢您的回复,我尝试添加了find_unused_parameters = True,但是训练速度会变慢(MMDet的文档也写了:but this will slow down the training speed),请问有其他方法吗

hust-lidelong avatar Dec 14 '21 06:12 hust-lidelong

@hust-lidelong 这个是因为or_pooling中定义了一组可学习的参数,但是没有使用,我在最新的commit中将这组参数注解了,现在应该可以直接分布式训练了。

jbwang1997 avatar Dec 14 '21 07:12 jbwang1997

@hust-lidelong 这个是因为or_pooling中定义了一组可学习的参数,但是没有使用,我在最新的commit中将这组参数注解了,现在应该可以直接分布式训练了。

现在可以分布式训练了,给您点赞

hust-lidelong avatar Dec 14 '21 07:12 hust-lidelong