ISBNet icon indicating copy to clipboard operation
ISBNet copied to clipboard

Training ScanNet200 dataset Error

Open xiaotiancai899 opened this issue 2 years ago • 3 comments

When I was training the ScanNet200 dataset, An error occured at the epoch55 out of 120.

Traceback (most recent call last): File "tools/train.py", line 332, in main() File "tools/train.py", line 323, in main train(epoch, model, optimizer, scheduler, scaler, train_loader, cfg, logger, writer) File "tools/train.py", line 80, in train loss, log_vars = model(batch, return_loss=True, epoch=epoch - 1) # 这个epoch有没有可能会变成-1之类的啊??? File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/isbnet.py", line 219, in forward return self.forward_train(**batch, epoch=epoch) File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/util/utils.py", line 172, in wrapper return func(*new_args, **new_kwargs) File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/isbnet.py", line 265, in forward_train feats, coords_float, voxel_coords, spatial_shape, batch_size, p2v_map File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/isbnet.py", line 632, in forward_backbone output = self.unet(output) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward output_decoder = self.u(output_decoder) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward output_decoder = self.u(output_decoder) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward output_decoder = self.u(output_decoder) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward output_decoder = self.u(output_decoder) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward output_decoder = self.u(output_decoder) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 249, in forward output_decoder = self.conv(output) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/spconv/pytorch/modules.py", line 137, in forward input = module(input) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/spconv/pytorch/conv.py", line 404, in forward raise e File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/spconv/pytorch/conv.py", line 395, in forward timer=input._timer) File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/spconv/pytorch/ops.py", line 465, in get_indice_pairs_implicit_gemm stream_int=stream) RuntimeError: /tmp/pip-build-env-a41g0q_q/overlay/lib/python3.7/site-packages/cumm/include/tensorview/cuda/launch.h(53) N > 0 assert faild. CUDA kernel launch blocks must be positive, but got N= 0

I used bach_size=1, and also avoided OOM during training freezing all BatchNorm layers during training. Any ideas about that? Thanks so much in advance!

xiaotiancai899 avatar Jun 10 '23 07:06 xiaotiancai899

@ngoductuanlhp

xiaotiancai899 avatar Jun 11 '23 05:06 xiaotiancai899

You could check similar issues on the original repo of spconv: https://github.com/traveller59/spconv/issues/406, https://github.com/mit-han-lab/bevfusion/issues/82.

Best.

ngoductuanlhp avatar Jun 11 '23 06:06 ngoductuanlhp

Those two cannot solve my problem. Any other advice?

You could check similar issues on the original repo of spconv: traveller59/spconv#406, mit-han-lab/bevfusion#82.

Best.

xiaotiancai899 avatar Jun 11 '23 06:06 xiaotiancai899