KeyError: 'loss_cls'
As the tutorial suggested, I started training the soft_teacher_faster_rcnn_r50_caffe_fpn_coco_180k.py model with the default settings on coco dataset.
The script that I used for training: for FOLD in 1 2 3 4 5; do bash tools/dist_train_partially.sh semi ${FOLD} 10 1 done
After 10m 33s of training, this is the error that I have received. Can anyone help me to debug this error please?
2021-11-09 22:53:41,757 - mmdet.ssod - INFO - Iter [150/180000] lr: 2.987e-03, eta: 2 days, 2:53:22, time: 0.962, data_time: 0.038, memory: 6776, ema_momentum: 0.9933, sup_loss_rpn_cls: 0.3320, sup_loss_rpn_bbox: 0.1104, sup_loss_cls: 0.5242, sup_acc: 94.4134, sup_loss_bbox: 0.2345, unsup_loss_rpn_cls: 0.1124, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0598, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.3733
2021-11-09 22:54:27,720 - mmdet.ssod - INFO - Iter [200/180000] lr: 3.986e-03, eta: 2 days, 1:38:03, time: 0.919, data_time: 0.037, memory: 6776, ema_momentum: 0.9950, sup_loss_rpn_cls: 0.2594, sup_loss_rpn_bbox: 0.0862, sup_loss_cls: 0.5218, sup_acc: 94.4144, sup_loss_bbox: 0.2317, unsup_loss_rpn_cls: 0.1035, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0579, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.2605
2021-11-09 22:55:17,469 - mmdet.ssod - INFO - Iter [250/180000] lr: 4.985e-03, eta: 2 days, 1:37:55, time: 0.995, data_time: 0.037, memory: 6776, ema_momentum: 0.9960, sup_loss_rpn_cls: 0.2289, sup_loss_rpn_bbox: 0.0875, sup_loss_cls: 0.5456, sup_acc: 94.2955, sup_loss_bbox: 0.2385, unsup_loss_rpn_cls: 0.0753, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0549, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.2307
2021-11-09 22:55:58,398 - mmdet.ssod - INFO - Iter [300/180000] lr: 5.984e-03, eta: 2 days, 0:09:31, time: 0.819, data_time: 0.036, memory: 6776, ema_momentum: 0.9967, sup_loss_rpn_cls: 0.2532, sup_loss_rpn_bbox: 0.0975, sup_loss_cls: 0.5785, sup_acc: 93.4809, sup_loss_bbox: 0.2680, unsup_loss_rpn_cls: 0.0862, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0669, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.3503
2021-11-09 22:56:44,082 - mmdet.ssod - INFO - Iter [350/180000] lr: 6.983e-03, eta: 1 day, 23:46:51, time: 0.914, data_time: 0.035, memory: 6776, ema_momentum: 0.9971, sup_loss_rpn_cls: 0.2057, sup_loss_rpn_bbox: 0.0756, sup_loss_cls: 0.5141, sup_acc: 94.4669, sup_loss_bbox: 0.2291, unsup_loss_rpn_cls: 0.0679, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0541, unsup_acc: 99.9980, unsup_loss_bbox: 0.0000, loss: 1.1465
2021-11-09 22:57:32,200 - mmdet.ssod - INFO - Iter [400/180000] lr: 7.982e-03, eta: 1 day, 23:47:52, time: 0.962, data_time: 0.035, memory: 6776, ema_momentum: 0.9975, sup_loss_rpn_cls: 0.2563, sup_loss_rpn_bbox: 0.1113, sup_loss_cls: 0.5627, sup_acc: 93.1599, sup_loss_bbox: 0.2776, unsup_loss_rpn_cls: 0.0925, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0738, unsup_acc: 99.9895, unsup_loss_bbox: 0.0000, loss: 1.3742
2021-11-09 22:58:20,723 - mmdet.ssod - INFO - Iter [450/180000] lr: 8.981e-03, eta: 1 day, 23:51:11, time: 0.970, data_time: 0.036, memory: 6776, ema_momentum: 0.9978, sup_loss_rpn_cls: 0.2339, sup_loss_rpn_bbox: 0.1020, sup_loss_cls: 0.5518, sup_acc: 94.3613, sup_loss_bbox: 0.2326, unsup_loss_rpn_cls: 0.0830, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0477, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.2511
2021-11-09 22:59:08,703 - mmdet.ssod - INFO - Iter [500/180000] lr: 9.980e-03, eta: 1 day, 23:50:25, time: 0.960, data_time: 0.035, memory: 6776, ema_momentum: 0.9980, sup_loss_rpn_cls: 0.2801, sup_loss_rpn_bbox: 0.1220, sup_loss_cls: 0.4242, sup_acc: 95.6570, sup_loss_bbox: 0.1740, unsup_loss_rpn_cls: 0.0971, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0479, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.1452
Traceback (most recent call last):
File "./train.py", line 198, in
I had similar problem. You can try use bigger batch size in samples_per_gpu in config file. You can try too the suggestions from #69
Thank you for responding. I will give it a try. Do you know to how to define numbers and ids of gpus during the training? Looks like if I defined <GPU_NUM> to 1, it automatic defaults to gpu0.
Hello. The train file tools/dist_train.sh use multiple distributed training processes by torch.distributed.launch. In this file the parameter nproc_per_node is the number of process per node, this number needs to be less or equal to the number of GPUs on the current system and each process will be operating on a single GPU from GPU 0 to GPU (nproc_per_node - 1).
If you want change the default gpu for the first process you can add --node_rank=<ANOTHER_GPU_NUMBER> in tools/dist_train.sh. Another way is before your code started the training, you try use:
torch.cuda.set_device('YOUR_GPU_ID')