[EfficientDet/PyTorch] TypeError: new(): invalid data type 'str' when training EfficientDet on Waymo dataset
Related to EfficientDet/PyTorch
Describe the bug When I try to reproduce the EfficientDet training result on Waymo dataset as described in: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Detection/Efficientdet Meet the " TypeError: new(): invalid data type 'str' " issue after loading the Waymo dataset and start training.
To Reproduce Steps to reproduce the behavior:
- Git clone 'https://github.com/NVIDIA/DeepLearningExamples', cd DeepLearningExamples/PyTorch/Detection/Efficientdet
- run 'waymo_tool/waymo_data_converter.py' to downloads and converts the Waymo data into COCO format
- Change the dataset path according to 'scripts/waymo/train_waymo_AMP_8xA100-80G.sh'
- Launch './distributed_train.sh 8 /datasets/Waymo_JoC --model efficientdet_d0 -b 8 --amp --lr 0.2 --sync-bn --opt fusedmomentum --warmup-epochs 1 --output Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N --worker 8 --fill-color mean --model-ema --model-ema-decay 0.999 --eval-after 24 --epochs 24 --save-checkpoint-interval 1 --smoothing 0.0 --waymo --remove-weights class_net box_net anchor --input_size 1536 --num_classes 3 --resume --freeze-layers backbone --waymo-train /datasets/Waymo_JoC/waymo_coco_format_train/images --waymo-val /datasets/Waymo_JoC/waymo_coco_format_val/images --waymo-val-annotation /datasets/Waymo_JoC/waymo_coco_format_val/annotations/annotations.json --waymo-train-annotation /datasets/Waymo_JoC/waymo_coco_format_train/annotations/annotations.json'
Expected behavior Expect the EfficientDet training on Waymo dataset can be smooth.
Environment
- Container version: pytorch:21.06-py3
- GPUs in the system: 8x Tesla A100-80GB
- CUDA version: 11.4
- CUDA driver version: 470.82.01
The log info for the training execution:
Added key: store_based_barrier_key:1 to store for rank: 6
Added key: store_based_barrier_key:1 to store for rank: 5
Added key: store_based_barrier_key:1 to store for rank: 3
Added key: store_based_barrier_key:1 to store for rank: 2
Added key: store_based_barrier_key:1 to store for rank: 7
Added key: store_based_barrier_key:1 to store for rank: 4
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 8 nodes.
Rank 6: Completed store-based barrier for 8 nodes.
Rank 5: Completed store-based barrier for 8 nodes.
Rank 3: Completed store-based barrier for 8 nodes.
Rank 2: Completed store-based barrier for 8 nodes.
Rank 7: Completed store-based barrier for 8 nodes.
Rank 4: Completed store-based barrier for 8 nodes.
Rank 1: Completed store-based barrier for 8 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 4, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 5, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 6, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 7, total 8.
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
DLL 2022-04-06 06:03:54.651781 - PARAMETER model_name : efficientdet_d0 param_count : 3826868
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Converted model to use Synchronized BatchNorm. WARNING: You may have issues if using zero initialized BN layers (enabled by default for ResNets) while sync-bn enabled.
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Using torch DistributedDataParallel. Install NVIDIA Apex for Apex DDP.
DLL 2022-04-06 06:03:56.451268 - PARAMETER Scheduled_epochs : 34
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=69.44s)
creating index...
Done (t=71.93s)
creating index...
Done (t=72.49s)
creating index...
Done (t=72.71s)
creating index...
Done (t=72.77s)
creating index...
Done (t=73.04s)
creating index...
Done (t=73.05s)
creating index...
Done (t=73.08s)
creating index...
index created!
index created!
index created!
index created!
index created!
index created!
index created!
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=22.79s)
creating index...
index created!
Done (t=23.14s)
creating index...
Done (t=22.88s)
creating index...
Done (t=23.24s)
creating index...
Done (t=23.59s)
creating index...
Done (t=23.13s)
creating index...
Done (t=23.37s)
creating index...
Done (t=23.39s)
creating index...
index created!
index created!
index created!
index created!
index created!
index created!
index created!
Traceback (most recent call last):
File "train.py", line 635, in