RTS train crash error happened on 35th epoch
Hi, I trained the RTS code on GOT-10K with only one A100 GPU, the first 34 epochs worked fine and produced the checkpoint file but the error occured when the 35th epoch running. the error detail as below:
Restarting training from last epoch ...
[train: 35, 50 / 1000] FPS: 4.0 (7.9) , Loss/total: 3.41968 , Loss/segm: 1.38611 , Stats/acc: 0.81421 , Stats/clf_acc: 0.82711 , Stats/clf_peak_dist: 1.40691 , Loss/target_clf: 0.28692 , Loss/test_init_clf: 0.56498 , Loss/test_iter_clf: 1.18168 , ClfTrain/test_loss: 0.00287 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00295
[train: 35, 100 / 1000] FPS: 5.4 (8.5) , Loss/total: 3.44221 , Loss/segm: 1.39679 , Stats/acc: 0.81277 , Stats/clf_acc: 0.82222 , Stats/clf_peak_dist: 1.45234 , Loss/target_clf: 0.28915 , Loss/test_init_clf: 0.56807 , Loss/test_iter_clf: 1.18819 , ClfTrain/test_loss: 0.00289 , ClfTrain/test_init_loss: 0.00568 , ClfTrain/test_iter_loss: 0.00297
[train: 35, 150 / 1000] FPS: 6.1 (8.0) , Loss/total: 3.43003 , Loss/segm: 1.38178 , Stats/acc: 0.81604 , Stats/clf_acc: 0.82044 , Stats/clf_peak_dist: 1.46284 , Loss/target_clf: 0.28963 , Loss/test_init_clf: 0.56793 , Loss/test_iter_clf: 1.19069 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00568 , ClfTrain/test_iter_loss: 0.00298
[train: 35, 200 / 1000] FPS: 6.5 (8.4) , Loss/total: 3.46724 , Loss/segm: 1.40390 , Stats/acc: 0.81261 , Stats/clf_acc: 0.81378 , Stats/clf_peak_dist: 1.49966 , Loss/target_clf: 0.29318 , Loss/test_init_clf: 0.56535 , Loss/test_iter_clf: 1.20482 , ClfTrain/test_loss: 0.00293 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00301
[train: 35, 250 / 1000] FPS: 6.8 (8.3) , Loss/total: 3.48664 , Loss/segm: 1.41969 , Stats/acc: 0.81263 , Stats/clf_acc: 0.81333 , Stats/clf_peak_dist: 1.50668 , Loss/target_clf: 0.29351 , Loss/test_init_clf: 0.56558 , Loss/test_iter_clf: 1.20786 , ClfTrain/test_loss: 0.00294 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00302
[train: 35, 300 / 1000] FPS: 7.0 (8.9) , Loss/total: 3.46265 , Loss/segm: 1.40664 , Stats/acc: 0.81405 , Stats/clf_acc: 0.81570 , Stats/clf_peak_dist: 1.48712 , Loss/target_clf: 0.29100 , Loss/test_init_clf: 0.56539 , Loss/test_iter_clf: 1.19962 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00300
[train: 35, 350 / 1000] FPS: 7.2 (8.5) , Loss/total: 3.48681 , Loss/segm: 1.42463 , Stats/acc: 0.81138 , Stats/clf_acc: 0.81352 , Stats/clf_peak_dist: 1.50249 , Loss/target_clf: 0.29233 , Loss/test_init_clf: 0.56588 , Loss/test_iter_clf: 1.20398 , ClfTrain/test_loss: 0.00292 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00301
[train: 35, 400 / 1000] FPS: 7.3 (8.6) , Loss/total: 3.48453 , Loss/segm: 1.42766 , Stats/acc: 0.81063 , Stats/clf_acc: 0.81439 , Stats/clf_peak_dist: 1.50199 , Loss/target_clf: 0.29131 , Loss/test_init_clf: 0.56518 , Loss/test_iter_clf: 1.20038 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00300
[train: 35, 450 / 1000] FPS: 7.4 (7.9) , Loss/total: 3.46665 , Loss/segm: 1.41136 , Stats/acc: 0.81178 , Stats/clf_acc: 0.81442 , Stats/clf_peak_dist: 1.49249 , Loss/target_clf: 0.29090 , Loss/test_init_clf: 0.56564 , Loss/test_iter_clf: 1.19875 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300
[train: 35, 500 / 1000] FPS: 7.5 (8.6) , Loss/total: 3.47195 , Loss/segm: 1.41802 , Stats/acc: 0.81161 , Stats/clf_acc: 0.81316 , Stats/clf_peak_dist: 1.49277 , Loss/target_clf: 0.29065 , Loss/test_init_clf: 0.56554 , Loss/test_iter_clf: 1.19773 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00299
[train: 35, 550 / 1000] FPS: 7.6 (8.6) , Loss/total: 3.47042 , Loss/segm: 1.41676 , Stats/acc: 0.81218 , Stats/clf_acc: 0.81317 , Stats/clf_peak_dist: 1.48925 , Loss/target_clf: 0.29068 , Loss/test_init_clf: 0.56515 , Loss/test_iter_clf: 1.19784 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00299
[train: 35, 600 / 1000] FPS: 7.6 (8.6) , Loss/total: 3.47111 , Loss/segm: 1.41505 , Stats/acc: 0.81288 , Stats/clf_acc: 0.81330 , Stats/clf_peak_dist: 1.48821 , Loss/target_clf: 0.29094 , Loss/test_init_clf: 0.56579 , Loss/test_iter_clf: 1.19933 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300
[train: 35, 650 / 1000] FPS: 7.7 (8.1) , Loss/total: 3.46644 , Loss/segm: 1.41017 , Stats/acc: 0.81301 , Stats/clf_acc: 0.81306 , Stats/clf_peak_dist: 1.49166 , Loss/target_clf: 0.29097 , Loss/test_init_clf: 0.56550 , Loss/test_iter_clf: 1.19980 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300
[train: 35, 700 / 1000] FPS: 7.7 (7.7) , Loss/total: 3.46132 , Loss/segm: 1.40578 , Stats/acc: 0.81357 , Stats/clf_acc: 0.81410 , Stats/clf_peak_dist: 1.48310 , Loss/target_clf: 0.29057 , Loss/test_init_clf: 0.56627 , Loss/test_iter_clf: 1.19870 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300
[train: 35, 750 / 1000] FPS: 7.7 (8.5) , Loss/total: 3.47012 , Loss/segm: 1.41298 , Stats/acc: 0.81291 , Stats/clf_acc: 0.81304 , Stats/clf_peak_dist: 1.48957 , Loss/target_clf: 0.29094 , Loss/test_init_clf: 0.56643 , Loss/test_iter_clf: 1.19976 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300
[train: 35, 800 / 1000] FPS: 7.8 (8.4) , Loss/total: 3.46390 , Loss/segm: 1.41175 , Stats/acc: 0.81328 , Stats/clf_acc: 0.81453 , Stats/clf_peak_dist: 1.48630 , Loss/target_clf: 0.29003 , Loss/test_init_clf: 0.56614 , Loss/test_iter_clf: 1.19598 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00299
[train: 35, 850 / 1000] FPS: 7.8 (7.8) , Loss/total: 3.46100 , Loss/segm: 1.40915 , Stats/acc: 0.81317 , Stats/clf_acc: 0.81399 , Stats/clf_peak_dist: 1.48657 , Loss/target_clf: 0.28999 , Loss/test_init_clf: 0.56622 , Loss/test_iter_clf: 1.19564 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00299
[train: 35, 900 / 1000] FPS: 7.8 (7.9) , Loss/total: 3.46227 , Loss/segm: 1.41145 , Stats/acc: 0.81321 , Stats/clf_acc: 0.81338 , Stats/clf_peak_dist: 1.48993 , Loss/target_clf: 0.28992 , Loss/test_init_clf: 0.56617 , Loss/test_iter_clf: 1.19472 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00299
[train: 35, 950 / 1000] FPS: 7.8 (8.9) , Loss/total: 3.46183 , Loss/segm: 1.40909 , Stats/acc: 0.81303 , Stats/clf_acc: 0.81387 , Stats/clf_peak_dist: 1.48878 , Loss/target_clf: 0.29013 , Loss/test_init_clf: 0.56697 , Loss/test_iter_clf: 1.19564 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00567 , ClfTrain/test_iter_loss: 0.00299
[train: 35, 1000 / 1000] FPS: 7.9 (8.2) , Loss/total: 3.46628 , Loss/segm: 1.41284 , Stats/acc: 0.81263 , Stats/clf_acc: 0.81293 , Stats/clf_peak_dist: 1.49386 , Loss/target_clf: 0.29033 , Loss/test_init_clf: 0.56673 , Loss/test_iter_clf: 1.19637 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00567 , ClfTrain/test_iter_loss: 0.00299
Training crashed at epoch 35
Traceback for the error!
Traceback (most recent call last):
File "/home/mytest/myprojects/pytracking-master/ltr/trainers/base_trainer.py", line 70, in train
self.train_epoch()
File "/home/mytest/myprojects/pytracking-master/ltr/trainers/ltr_trainer.py", line 93, in train_epoch
self.cycle_dataset(loader)
File "/home/mytest/myprojects/pytracking-master/ltr/trainers/ltr_trainer.py", line 66, in cycle_dataset
for i, data in enumerate(loader, 1):
File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
Restarting training from last epoch ... Finished training!
Process finished with exit code 0
Hi @mywebinfo65536. This is a bit strange, it looks like missing data that the sampler tries to load. Is it reproducible ? Could you tell me what's the size of self.sequence_list for your got10k ?
Hi @mywebinfo65536. This is a bit strange, it looks like missing data that the sampler tries to load. Is it reproducible ? Could you tell me what's the size of
self.sequence_listfor your got10k ?
Hello mattpfr, the original self.sequence_list size is 7086 and after handled the [GOT-10k_Train_004419], the final size is 7085
By the way, could you tell me the trainning speed (FPS: 6-8) is right? I just felt it a bit low in my trainning.
Hi @mywebinfo65536. This is a bit strange, it looks like missing data that the sampler tries to load. Is it reproducible ? Could you tell me what's the size of
self.sequence_listfor your got10k ?Hello mattpfr, the original self.sequence_list size is 7086 and after handled the [GOT-10k_Train_004419], the final size is 7085
This seems alright to me. the 004419 is indeed missing, but if it is not in the sequence_list, it should not be a problem. Maybe it would be helpful to log somewhere the sequence currently processed ?
By the way, could you tell me the trainning speed (FPS: 6-8) is right? I just felt it a bit low in my trainning.
I guess it depends on the hardware you use. When I trained, I was at around ~10 FPS. So your numbers seem normal
@mywebinfo65536 do you have any further issues ? Or is it working for you now ?
@mywebinfo65536 do you have any further issues ? Or is it working for you now ?
Hi mattpfr, thanks for your reply, I now have no further issues, but it still not be fixed for me, I will check it later on my free time.
@mywebinfo65536 thanks, yes please let me know indeed so that I know if there is something to do or not with this ticket.
I also encountered this error in the 5th epoch of training, how to solve it?