SegFormer SegFormerB1 CityScapes - CUDA error: an illegal memory access was encountered

Hello,

I want to run the training code as follow with 1 GPU : python tools/train.py local_configs/segformer/B1/segformer.b1.1024x1024.city.160k.py --gpus 1

Firstly, I got the error AssertionError: Default process group is not initialized. I followed the others comments by replacing all SyncBN with BN in the code.

After that, I obtain this error :

Traceback (most recent call last):
  File "tools/train.py", line 166, in <module>
    main()
  File "tools/train.py", line 155, in main
    train_segmentor(
  File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/apis/train.py", line 115, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 131, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/models/segmentors/base.py", line 153, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/models/segmentors/base.py", line 204, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629395347/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fa9dae8377d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7fa9db0d3d9d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fa9dae6fb1d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53956b (0x7faa189e156b in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #21: __libc_start_main + 0xf5 (0x7faa45c2a555 in /lib64/libc.so.6)

Aborted

Same error when I launch the command tools/dist_train.sh local_configs/segformer/B1/segformer.b1.1024x1024.city.160k.py 1 with dist_train.sh

I always obtain this CUDA error: an illegal memory access was encountered ...

May 16 '23 13:05 adriengoleb

hi, same problem occurring here. any progress?

Mar 15 '24 06:03 hhhyyeee

form the ' log_vars[loss_name] = loss_value.item()' ,i think it happens for the wrong index labels in the dataset enhancing process.

checking you labels:

it should be using the id for each class rather than the rgb labels
checking the "ignore index" , for cross entropy function, its default "ignore index =-100", you should ignore the right index in your dataset config files,

as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function.

for example: for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by : the dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),

Jun 14 '24 08:06 zenhanghg-heng

form the ' log_vars[loss_name] = loss_value.item()' ,i think it happens for the wrong index labels in the dataset enhancing process.

checking you labels:

it should be using the id for each class rather than the rgb labels

checking the "ignore index" , for cross entropy function, its default "ignore index =-100", you should ignore the right index in your dataset config files,

as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function.

for example: for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by : the dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),

Well it doesn't work

Jul 16 '24 06:07 zkf85

form the ' log_vars[loss_name] = loss_value.item()' ,i think it happens for the wrong index labels in the dataset enhancing process. checking you labels:

it should be using the id for each class rather than the rgb labels

checking the "ignore index" , for cross entropy function, its default "ignore index =-100", you should ignore the right index in your dataset config files,

as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function. for example: for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by : the dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),

Well it doesn't work

I encounter this problem when trying to train coco-stuff164k with self written dataset config and segformer config files. In my case I finally found the problem to be the inconsistency between the assigned class number in the segformer config file and the one defined in mmseg dataset classs python file. In the coco-stuff case, I originally set this value to 172, and then resulted in the errors mentioned abov. After modifying it to the value set in mmseg/datasets/CocoStuff.py, which is 182, the training continues smoothly without the error.

Jul 25 '24 02:07 zkf85