VAC_CSLR icon indicating copy to clipboard operation
VAC_CSLR copied to clipboard

IndexError in DataLoader Worker Process with Custom Dataset

Open yulrio opened this issue 1 year ago • 10 comments

Hello,

I'm currently using your code from the repository [insert repository name] with my own dataset, but I'm encountering an IndexError during the training phase. Below is the traceback I received:

[ Fri Aug 16 10:18:36 2024 ] Parameters: {'work_dir': './work_dir/baseline_res18/', 'config': './configs/baseline.yaml', 'random_fix': True, 'device': '3', 'phase': 'train', 'save_interval': 5, 'random_seed': 0, 'eval_interval': 1, 'print_log': True, 'log_interval': 50, 'evaluate_tool': 'sclite', 'feeder': 'dataset.dataloader_video.BaseFeeder', 'dataset': 'QSLR2024', 'dataset_info': {'dataset_root': './dataset/QSLR2024', 'dict_path': './preprocess/QSLR2024/gloss_dict.npy', 'evaluation_dir': './evaluation/slr_eval', 'evaluation_prefix': 'QSLR2024-groundtruth'}, 'num_worker': 10, 'feeder_args': {'mode': 'test', 'datatype': 'video', 'num_gloss': -1, 'drop_ratio': 1.0, 'prefix': './dataset/QSLR2024', 'transform_mode': False}, 'model': 'slr_network.SLRModel', 'model_args': {'num_classes': 65, 'c2d_type': 'resnet18', 'conv_type': 2, 'use_bn': 1, 'share_classifier': False, 'weight_norm': False}, 'load_weights': None, 'load_checkpoints': None, 'decode_mode': 'beam', 'ignore_weights': [], 'batch_size': 2, 'test_batch_size': 8, 'loss_weights': {'SeqCTC': 1.0}, 'optimizer_args': {'optimizer': 'Adam', 'base_lr': 0.0001, 'step': [20, 35], 'learning_ratio': 1, 'weight_decay': 0.0001, 'start_epoch': 0, 'nesterov': False}, 'num_epoch': 30}

0%| | 0/162 [00:00<?, ?it/s] Traceback (most recent call last): File "/raid/data/m33221012/VAC_CSLR_QSLR/main.py", line 213, in processor.start() File "/raid/data/m33221012/VAC_CSLR_QSLR/main.py", line 44, in start seq_train(self.data_loader['train'], self.model, self.optimizer, File "/raid/data/m33221012/VAC_CSLR_QSLR/seq_scripts.py", line 18, in seq_train for batch_idx, data in enumerate(tqdm(loader)): File "/home/m33221012/miniconda3/envs/py31012/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/home/m33221012/miniconda3/envs/py31012/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() File "/home/m33221012/miniconda3/envs/py31012/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data return self._process_data(data) File "/home/m33221012/miniconda3/envs/py31012/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data data.reraise() File "/home/m33221012/miniconda3/envs/py31012/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise raise exception IndexError: Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/m33221012/miniconda3/envs/py31012/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/home/m33221012/miniconda3/envs/py31012/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/m33221012/miniconda3/envs/py31012/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/raid/data/m33221012/VAC_CSLR_QSLR/dataset/dataloader_video.py", line 48, in getitem input_data, label = self.normalize(input_data, label) File "/raid/data/m33221012/VAC_CSLR_QSLR/dataset/dataloader_video.py", line 80, in normalize video, label = self.data_aug(video, label, file_id) File "/raid/data/m33221012/VAC_CSLR_QSLR/utils/video_augmentation.py", line 24, in call image = t(image) File "/raid/data/m33221012/VAC_CSLR_QSLR/utils/video_augmentation.py", line 119, in call if isinstance(clip[0], np.ndarray): IndexError: list index out of range

It seems the issue occurs within the video_augmentation.py script when accessing clip[0]. I suspect it might be related to the data augmentation process or the input data structure.

Since I'm using my own dataset, could you please let me know what specific adjustments or preprocessing steps are necessary to ensure compatibility with your code? Additionally, is there a possibility that this error is related to hardware settings, such as GPU configuration or memory limitations?

Any advice on how to resolve this error and properly integrate my dataset would be greatly appreciated.

Thank you in advance for your help!

yulrio avatar Aug 16 '24 13:08 yulrio

Did you run the preprocessing script on your training data before training? I was having this issue too when using a custom dataset, but after running the pre-processing script it worked out fine.

RafaelAmauri avatar Sep 26 '24 00:09 RafaelAmauri

Thank you for replying to my question. May I know the configuration of the .yaml file? Thanks in advance.

yulrio avatar Sep 29 '24 09:09 yulrio

I am using the default values. I haven't changed any configs

RafaelAmauri avatar Sep 29 '24 13:09 RafaelAmauri

I just ran the following command:

!python main.py --load-weights resnet18_baseline_dev_23.80_epoch25_model.pt --phase test --device 0

and got the following result:

Loading model finished.
Loading data
train 5671
Apply training transform.

train 5671
Apply testing transform.

dev 540
Apply testing transform.

test 629
Apply testing transform.

Loading data finished.
Working tree is dirty. Patch:
diff --git a/.gitignore b/.gitignore
old mode 100755
new mode 100644

[ Tue Oct  8 22:35:41 2024 ] Model: slr_network.SLRModel.
[ Tue Oct  8 22:35:42 2024 ] Weights: /content/drive/MyDrive/MyResearch/pretrain/resnet18_baseline_dev_23.80_epoch25_model.pt.
100% 68/68 [1:09:24<00:00, 61.24s/it]
/content/drive/MyDrive/MyResearch/VAC_CSLR_ORI_OSL
preprocess.sh ./work_dir/baseline_res18/output-hypothesis-dev-conv.ctm ./work_dir/baseline_res18/tmp.ctm ./work_dir/baseline_res18/tmp2.ctm
Tue Oct 8 11:45:07 PM UTC 2024
Preprocess Finished.
Unexpected error: <class 'AttributeError'>
[ Tue Oct  8 23:45:07 2024 ] Epoch 6667, dev 100.00%
100% 79/79 [1:15:47<00:00, 57.56s/it]
/content/drive/MyDrive/MyResearch/VAC_CSLR_ORI_OSL
preprocess.sh ./work_dir/baseline_res18/output-hypothesis-test-conv.ctm ./work_dir/baseline_res18/tmp.ctm ./work_dir/baseline_res18/tmp2.ctm
Wed Oct 9 01:00:55 AM UTC 2024
Preprocess Finished.
Unexpected error: <class 'AttributeError'>
[ Wed Oct  9 01:00:55 2024 ] Epoch 6667, test 100.00%
[ Wed Oct  9 01:00:55 2024 ] Evaluation Done.

Can you explain why the error Unexpected error: <class 'AttributeError'> occurred and which part of the code needs to be corrected?

Also, why did I get 100% for both dev and test?

Thanks in advance!

Onestringlab avatar Oct 09 '24 01:10 Onestringlab

File "/raid/data/m33221012/VAC_CSLR_QSLR/dataset/dataloader_video.py", line 48, in getitem input_data, label = self.normalize(input_data, label) File "/raid/data/m33221012/VAC_CSLR_QSLR/dataset/dataloader_video.py", line 80, in normalize video, label = self.data_aug(video, label, file_id) File "/raid/data/m33221012/VAC_CSLR_QSLR/utils/video_augmentation.py", line 24, in call image = t(image) File "/raid/data/m33221012/VAC_CSLR_QSLR/utils/video_augmentation.py", line 119, in call if isinstance(clip[0], np.ndarray): IndexError: list index out of range

Just in case anyone else runs into this, this error happens because the dataloader couldn't load the dataset for whatever reason. I just had this error again because inside my dataset I had it like this: dataset/features/train,test,dev. I forgot to add the 'fullFrame-256x256px' folder right after features, and because of that the dataloader wasn't able to find the train/test/dev folders. It is hard-coded to look specifically for a fullFrame-256x256px folder, and when it couldn't find one, nothing was loaded.

This is to say, make sure that the structure inside your custom dataset is 100% similar to the one found inside phoenix2014. Any changes could break the training script.

RafaelAmauri avatar Oct 30 '24 04:10 RafaelAmauri

I just ran the following command:

!python main.py --load-weights resnet18_baseline_dev_23.80_epoch25_model.pt --phase test --device 0

and got the following result:

Loading model finished.
Loading data
train 5671
Apply training transform.

train 5671
Apply testing transform.

dev 540
Apply testing transform.

test 629
Apply testing transform.

Loading data finished.
Working tree is dirty. Patch:
diff --git a/.gitignore b/.gitignore
old mode 100755
new mode 100644

[ Tue Oct  8 22:35:41 2024 ] Model: slr_network.SLRModel.
[ Tue Oct  8 22:35:42 2024 ] Weights: /content/drive/MyDrive/MyResearch/pretrain/resnet18_baseline_dev_23.80_epoch25_model.pt.
100% 68/68 [1:09:24<00:00, 61.24s/it]
/content/drive/MyDrive/MyResearch/VAC_CSLR_ORI_OSL
preprocess.sh ./work_dir/baseline_res18/output-hypothesis-dev-conv.ctm ./work_dir/baseline_res18/tmp.ctm ./work_dir/baseline_res18/tmp2.ctm
Tue Oct 8 11:45:07 PM UTC 2024
Preprocess Finished.
Unexpected error: <class 'AttributeError'>
[ Tue Oct  8 23:45:07 2024 ] Epoch 6667, dev 100.00%
100% 79/79 [1:15:47<00:00, 57.56s/it]
/content/drive/MyDrive/MyResearch/VAC_CSLR_ORI_OSL
preprocess.sh ./work_dir/baseline_res18/output-hypothesis-test-conv.ctm ./work_dir/baseline_res18/tmp.ctm ./work_dir/baseline_res18/tmp2.ctm
Wed Oct 9 01:00:55 AM UTC 2024
Preprocess Finished.
Unexpected error: <class 'AttributeError'>
[ Wed Oct  9 01:00:55 2024 ] Epoch 6667, test 100.00%
[ Wed Oct  9 01:00:55 2024 ] Evaluation Done.

Can you explain why the error Unexpected error: <class 'AttributeError'> occurred and which part of the code needs to be corrected?

Also, why did I get 100% for both dev and test?

Thanks in advance!

I don't know how to fix the AttributeError, but getting 100% WER on the dev and test splits happens because it could be using the wrong ground-truth file.

Have you checked your configs/{your_dataset_name}.yaml?

In this config file there is a 'evaluation_dir' and a 'evaluation_prefix' key that points to where the ground truth files are. The ground truths are the .stm files generated by the preprocessing step and are stored in ./preprocess/{your_dataset_name}/.

The preprocessing tool generates these .stm files automatically, but it doesn't move them to ./evaluation/slr_eval/ by itself. You have to move them to these folders by yourself.

After you run the preprocessing step, you should see a new folder created inside the preprocess folder with the name of your dataset. There you will find the .stm files with the groundtruth.

Good luck!

Edit: The comment had wrong information about the .stm files. Replaced it with correct information

RafaelAmauri avatar Oct 30 '24 04:10 RafaelAmauri

Thank you for the answer.

Could you let me know which version of PyTorch you used for these experiments?

Thanks again!

Onestringlab avatar Oct 31 '24 06:10 Onestringlab

Thank you for the answer.

Could you let me know which version of PyTorch you used for these experiments?

Thanks again!

I'm using python 3.8.10 and pytorch 1.13.1 inside a docker container

RafaelAmauri avatar Nov 01 '24 01:11 RafaelAmauri

I just ran the following command: !python main.py --load-weights resnet18_baseline_dev_23.80_epoch25_model.pt --phase test --device 0

and got the following result:

Loading model finished.
Loading data
train 5671
Apply training transform.

train 5671
Apply testing transform.

dev 540
Apply testing transform.

test 629
Apply testing transform.

Loading data finished.
Working tree is dirty. Patch:
diff --git a/.gitignore b/.gitignore
old mode 100755
new mode 100644

[ Tue Oct  8 22:35:41 2024 ] Model: slr_network.SLRModel.
[ Tue Oct  8 22:35:42 2024 ] Weights: /content/drive/MyDrive/MyResearch/pretrain/resnet18_baseline_dev_23.80_epoch25_model.pt.
100% 68/68 [1:09:24<00:00, 61.24s/it]
/content/drive/MyDrive/MyResearch/VAC_CSLR_ORI_OSL
preprocess.sh ./work_dir/baseline_res18/output-hypothesis-dev-conv.ctm ./work_dir/baseline_res18/tmp.ctm ./work_dir/baseline_res18/tmp2.ctm
Tue Oct 8 11:45:07 PM UTC 2024
Preprocess Finished.
Unexpected error: <class 'AttributeError'>
[ Tue Oct  8 23:45:07 2024 ] Epoch 6667, dev 100.00%
100% 79/79 [1:15:47<00:00, 57.56s/it]
/content/drive/MyDrive/MyResearch/VAC_CSLR_ORI_OSL
preprocess.sh ./work_dir/baseline_res18/output-hypothesis-test-conv.ctm ./work_dir/baseline_res18/tmp.ctm ./work_dir/baseline_res18/tmp2.ctm
Wed Oct 9 01:00:55 AM UTC 2024
Preprocess Finished.
Unexpected error: <class 'AttributeError'>
[ Wed Oct  9 01:00:55 2024 ] Epoch 6667, test 100.00%
[ Wed Oct  9 01:00:55 2024 ] Evaluation Done.

Can you explain why the error Unexpected error: <class 'AttributeError'> occurred and which part of the code needs to be corrected? Also, why did I get 100% for both dev and test? Thanks in advance!

I don't know how to fix the AttributeError, but getting 100% WER on the dev and test splits happens because it could be using the wrong ground-truth file.

Have you checked your configs/{your_dataset_name}.yaml?

In this config file there is a 'evaluation_dir' and a 'evaluation_prefix' key that points to where the ground truth files are. The ground truths are the .stm files generated by the preprocessing step and are stored in ./preprocess/{your_dataset_name}/.

The preprocessing tool generates these .stm files automatically, but it doesn't move them to ./evaluation/slr_eval/ by itself. You have to move them to these folders by yourself.

After you run the preprocessing step, you should see a new folder created inside the preprocess folder with the name of your dataset. There you will find the .stm files with the groundtruth.

Good luck!

Edit: The comment had wrong information about the .stm files. Replaced it with correct information

Thanks for the method, but after I tried what you said, replace the .stm file in evaluation folder that generate from preprocess.py, there still exists the same question of DEV 100% WER, could you please help me with this.

thanks a lot!

BLOOM0-0 avatar Mar 28 '25 07:03 BLOOM0-0

I just ran the following command: !python main.py --load-weights resnet18_baseline_dev_23.80_epoch25_model.pt --phase test --device 0 and got the following result:

Loading model finished.
Loading data
train 5671
Apply training transform.

train 5671
Apply testing transform.

dev 540
Apply testing transform.

test 629
Apply testing transform.

Loading data finished.
Working tree is dirty. Patch:
diff --git a/.gitignore b/.gitignore
old mode 100755
new mode 100644

[ Tue Oct  8 22:35:41 2024 ] Model: slr_network.SLRModel.
[ Tue Oct  8 22:35:42 2024 ] Weights: /content/drive/MyDrive/MyResearch/pretrain/resnet18_baseline_dev_23.80_epoch25_model.pt.
100% 68/68 [1:09:24<00:00, 61.24s/it]
/content/drive/MyDrive/MyResearch/VAC_CSLR_ORI_OSL
preprocess.sh ./work_dir/baseline_res18/output-hypothesis-dev-conv.ctm ./work_dir/baseline_res18/tmp.ctm ./work_dir/baseline_res18/tmp2.ctm
Tue Oct 8 11:45:07 PM UTC 2024
Preprocess Finished.
Unexpected error: <class 'AttributeError'>
[ Tue Oct  8 23:45:07 2024 ] Epoch 6667, dev 100.00%
100% 79/79 [1:15:47<00:00, 57.56s/it]
/content/drive/MyDrive/MyResearch/VAC_CSLR_ORI_OSL
preprocess.sh ./work_dir/baseline_res18/output-hypothesis-test-conv.ctm ./work_dir/baseline_res18/tmp.ctm ./work_dir/baseline_res18/tmp2.ctm
Wed Oct 9 01:00:55 AM UTC 2024
Preprocess Finished.
Unexpected error: <class 'AttributeError'>
[ Wed Oct  9 01:00:55 2024 ] Epoch 6667, test 100.00%
[ Wed Oct  9 01:00:55 2024 ] Evaluation Done.

Can you explain why the error Unexpected error: <class 'AttributeError'> occurred and which part of the code needs to be corrected? Also, why did I get 100% for both dev and test? Thanks in advance!

I don't know how to fix the AttributeError, but getting 100% WER on the dev and test splits happens because it could be using the wrong ground-truth file. Have you checked your configs/{your_dataset_name}.yaml? In this config file there is a 'evaluation_dir' and a 'evaluation_prefix' key that points to where the ground truth files are. The ground truths are the .stm files generated by the preprocessing step and are stored in ./preprocess/{your_dataset_name}/. The preprocessing tool generates these .stm files automatically, but it doesn't move them to ./evaluation/slr_eval/ by itself. You have to move them to these folders by yourself. After you run the preprocessing step, you should see a new folder created inside the preprocess folder with the name of your dataset. There you will find the .stm files with the groundtruth. Good luck! Edit: The comment had wrong information about the .stm files. Replaced it with correct information

Thanks for the method, but after I tried what you said, replace the .stm file in evaluation folder that generate from preprocess.py, there still exists the same question of DEV 100% WER, could you please help me with this.

thanks a lot!

Sure, I can help you out.

First off, did you follow all the preprocessing instructions in the README? Because using this should be as easy as copying and pasting the steps in the README instructions.

Is your problem only that you're getting 100% WER on the dev set? Besides that, is your training loop running normally? By "normal" I mean if the loss is going down, if there aren't any warnings or errors, etc.

What about the test set? Are you getting 100% WER on it too?

RafaelAmauri avatar Mar 28 '25 13:03 RafaelAmauri