Fast-BEV icon indicating copy to clipboard operation
Fast-BEV copied to clipboard

训练过程中会出现错误

Open AndrewJSong opened this issue 2 years ago • 4 comments

无法完成一个epoch,每次在不同的batch时出现错误。 问题1:seed的设置默认取的0,对于数据加载是无效的吗?如果生效应该每次错误发生在同样的时期。 问题2:使用作者提供的pkl以及自己生成的pkl都是在训练的一个epoch内就会出现错误。使用mini数据集可以完成20个epoch的迭代。

环境: sys.platform: linux Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0] CUDA available: True GPU 0,1,2,3: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29190527_0 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.8.1 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • CuDNN 8.0.5
  • Magma 2.5.2

TorchVision: 0.9.1 OpenCV: 4.7.0 MMCV: 1.4.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.14.0 MMSegmentation: 0.14.1 MMDetection3D: 0.16.0+69d67ff

错误提示如下: 2023-03-29 05:03:13,493 - mmdet - INFO - Epoch [1][1860/4004] lr: 3.907e-04, eta: 3 days, 9:03:25, time: 3.341, data_time: 0.491, memory: 18153, positive_bag_loss: 1.4544, negative_bag_loss: 0.1518, loss: 1.6061, grad_norm: 1.5741 Traceback (most recent call last): File "tools/train.py", line 279, in main() File "tools/train.py", line 268, in main train_model( File "/workspace/Fast-BEV/mmdet3d/apis/train.py", line 184, in train_model train_detector( File "/workspace/Fast-BEV/mmdet3d/apis/train.py", line 159, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 237, in train_step losses = self(**data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func return old_func(*args, **kwargs) File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 294, in forward return self.forward_train(img, img_metas, **kwargs) File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 301, in forward_train feature_bev, valids, features_2d = self.extract_feat(img, img_metas, "train") File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 123, in extract_feat x = self.backbone( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 642, in forward x = res_layer(x) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 89, in forward out = _inner_forward(x) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 72, in _inner_forward out = self.conv1(x) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward return self._conv_forward(input, self.weight, self.bias) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward return F.conv2d(input, weight, bias, self.stride, File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 6011) is killed by signal: Killed. Killing subprocess 740 Killing subprocess 741 Killing subprocess 742 Killing subprocess 743 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'tools/train.py', '--local_rank=3', './configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py', '--work-dir=./work_dirs/my/exp/', '--launcher=pytorch', '--gpus', '4']' returned non-zero exit status 1.

AndrewJSong avatar Mar 29 '23 05:03 AndrewJSong

hello 你有解决方案嘛? 我也碰到同样的情况,一般一个epoch会出现一次 https://discuss.pytorch.org/t/died-with-signals-sigkill-9-when-in-first-epoch-the-program-is-killed/131704/1 和这个链接差不多,但不太好定位,应该是哪里 out of memory了

chr10003566 avatar Nov 01 '23 09:11 chr10003566

请问有人解决这个问题了嘛

Nepenthes-zlc avatar Jan 21 '24 15:01 Nepenthes-zlc

请问有人解决这个问题了嘛

evercherish avatar May 25 '24 08:05 evercherish