pytracking icon indicating copy to clipboard operation
pytracking copied to clipboard

RTS train crash error happened on 35th epoch

Open mywebinfo65536 opened this issue 3 years ago • 9 comments

Hi, I trained the RTS code on GOT-10K with only one A100 GPU, the first 34 epochs worked fine and produced the checkpoint file but the error occured when the 35th epoch running. the error detail as below:

Restarting training from last epoch ... [train: 35, 50 / 1000] FPS: 4.0 (7.9) , Loss/total: 3.41968 , Loss/segm: 1.38611 , Stats/acc: 0.81421 , Stats/clf_acc: 0.82711 , Stats/clf_peak_dist: 1.40691 , Loss/target_clf: 0.28692 , Loss/test_init_clf: 0.56498 , Loss/test_iter_clf: 1.18168 , ClfTrain/test_loss: 0.00287 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00295 [train: 35, 100 / 1000] FPS: 5.4 (8.5) , Loss/total: 3.44221 , Loss/segm: 1.39679 , Stats/acc: 0.81277 , Stats/clf_acc: 0.82222 , Stats/clf_peak_dist: 1.45234 , Loss/target_clf: 0.28915 , Loss/test_init_clf: 0.56807 , Loss/test_iter_clf: 1.18819 , ClfTrain/test_loss: 0.00289 , ClfTrain/test_init_loss: 0.00568 , ClfTrain/test_iter_loss: 0.00297 [train: 35, 150 / 1000] FPS: 6.1 (8.0) , Loss/total: 3.43003 , Loss/segm: 1.38178 , Stats/acc: 0.81604 , Stats/clf_acc: 0.82044 , Stats/clf_peak_dist: 1.46284 , Loss/target_clf: 0.28963 , Loss/test_init_clf: 0.56793 , Loss/test_iter_clf: 1.19069 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00568 , ClfTrain/test_iter_loss: 0.00298 [train: 35, 200 / 1000] FPS: 6.5 (8.4) , Loss/total: 3.46724 , Loss/segm: 1.40390 , Stats/acc: 0.81261 , Stats/clf_acc: 0.81378 , Stats/clf_peak_dist: 1.49966 , Loss/target_clf: 0.29318 , Loss/test_init_clf: 0.56535 , Loss/test_iter_clf: 1.20482 , ClfTrain/test_loss: 0.00293 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00301 [train: 35, 250 / 1000] FPS: 6.8 (8.3) , Loss/total: 3.48664 , Loss/segm: 1.41969 , Stats/acc: 0.81263 , Stats/clf_acc: 0.81333 , Stats/clf_peak_dist: 1.50668 , Loss/target_clf: 0.29351 , Loss/test_init_clf: 0.56558 , Loss/test_iter_clf: 1.20786 , ClfTrain/test_loss: 0.00294 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00302 [train: 35, 300 / 1000] FPS: 7.0 (8.9) , Loss/total: 3.46265 , Loss/segm: 1.40664 , Stats/acc: 0.81405 , Stats/clf_acc: 0.81570 , Stats/clf_peak_dist: 1.48712 , Loss/target_clf: 0.29100 , Loss/test_init_clf: 0.56539 , Loss/test_iter_clf: 1.19962 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00300 [train: 35, 350 / 1000] FPS: 7.2 (8.5) , Loss/total: 3.48681 , Loss/segm: 1.42463 , Stats/acc: 0.81138 , Stats/clf_acc: 0.81352 , Stats/clf_peak_dist: 1.50249 , Loss/target_clf: 0.29233 , Loss/test_init_clf: 0.56588 , Loss/test_iter_clf: 1.20398 , ClfTrain/test_loss: 0.00292 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00301 [train: 35, 400 / 1000] FPS: 7.3 (8.6) , Loss/total: 3.48453 , Loss/segm: 1.42766 , Stats/acc: 0.81063 , Stats/clf_acc: 0.81439 , Stats/clf_peak_dist: 1.50199 , Loss/target_clf: 0.29131 , Loss/test_init_clf: 0.56518 , Loss/test_iter_clf: 1.20038 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00300 [train: 35, 450 / 1000] FPS: 7.4 (7.9) , Loss/total: 3.46665 , Loss/segm: 1.41136 , Stats/acc: 0.81178 , Stats/clf_acc: 0.81442 , Stats/clf_peak_dist: 1.49249 , Loss/target_clf: 0.29090 , Loss/test_init_clf: 0.56564 , Loss/test_iter_clf: 1.19875 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300 [train: 35, 500 / 1000] FPS: 7.5 (8.6) , Loss/total: 3.47195 , Loss/segm: 1.41802 , Stats/acc: 0.81161 , Stats/clf_acc: 0.81316 , Stats/clf_peak_dist: 1.49277 , Loss/target_clf: 0.29065 , Loss/test_init_clf: 0.56554 , Loss/test_iter_clf: 1.19773 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00299 [train: 35, 550 / 1000] FPS: 7.6 (8.6) , Loss/total: 3.47042 , Loss/segm: 1.41676 , Stats/acc: 0.81218 , Stats/clf_acc: 0.81317 , Stats/clf_peak_dist: 1.48925 , Loss/target_clf: 0.29068 , Loss/test_init_clf: 0.56515 , Loss/test_iter_clf: 1.19784 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00565 , ClfTrain/test_iter_loss: 0.00299 [train: 35, 600 / 1000] FPS: 7.6 (8.6) , Loss/total: 3.47111 , Loss/segm: 1.41505 , Stats/acc: 0.81288 , Stats/clf_acc: 0.81330 , Stats/clf_peak_dist: 1.48821 , Loss/target_clf: 0.29094 , Loss/test_init_clf: 0.56579 , Loss/test_iter_clf: 1.19933 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300 [train: 35, 650 / 1000] FPS: 7.7 (8.1) , Loss/total: 3.46644 , Loss/segm: 1.41017 , Stats/acc: 0.81301 , Stats/clf_acc: 0.81306 , Stats/clf_peak_dist: 1.49166 , Loss/target_clf: 0.29097 , Loss/test_init_clf: 0.56550 , Loss/test_iter_clf: 1.19980 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300 [train: 35, 700 / 1000] FPS: 7.7 (7.7) , Loss/total: 3.46132 , Loss/segm: 1.40578 , Stats/acc: 0.81357 , Stats/clf_acc: 0.81410 , Stats/clf_peak_dist: 1.48310 , Loss/target_clf: 0.29057 , Loss/test_init_clf: 0.56627 , Loss/test_iter_clf: 1.19870 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300 [train: 35, 750 / 1000] FPS: 7.7 (8.5) , Loss/total: 3.47012 , Loss/segm: 1.41298 , Stats/acc: 0.81291 , Stats/clf_acc: 0.81304 , Stats/clf_peak_dist: 1.48957 , Loss/target_clf: 0.29094 , Loss/test_init_clf: 0.56643 , Loss/test_iter_clf: 1.19976 , ClfTrain/test_loss: 0.00291 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00300 [train: 35, 800 / 1000] FPS: 7.8 (8.4) , Loss/total: 3.46390 , Loss/segm: 1.41175 , Stats/acc: 0.81328 , Stats/clf_acc: 0.81453 , Stats/clf_peak_dist: 1.48630 , Loss/target_clf: 0.29003 , Loss/test_init_clf: 0.56614 , Loss/test_iter_clf: 1.19598 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00299 [train: 35, 850 / 1000] FPS: 7.8 (7.8) , Loss/total: 3.46100 , Loss/segm: 1.40915 , Stats/acc: 0.81317 , Stats/clf_acc: 0.81399 , Stats/clf_peak_dist: 1.48657 , Loss/target_clf: 0.28999 , Loss/test_init_clf: 0.56622 , Loss/test_iter_clf: 1.19564 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00299 [train: 35, 900 / 1000] FPS: 7.8 (7.9) , Loss/total: 3.46227 , Loss/segm: 1.41145 , Stats/acc: 0.81321 , Stats/clf_acc: 0.81338 , Stats/clf_peak_dist: 1.48993 , Loss/target_clf: 0.28992 , Loss/test_init_clf: 0.56617 , Loss/test_iter_clf: 1.19472 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00566 , ClfTrain/test_iter_loss: 0.00299 [train: 35, 950 / 1000] FPS: 7.8 (8.9) , Loss/total: 3.46183 , Loss/segm: 1.40909 , Stats/acc: 0.81303 , Stats/clf_acc: 0.81387 , Stats/clf_peak_dist: 1.48878 , Loss/target_clf: 0.29013 , Loss/test_init_clf: 0.56697 , Loss/test_iter_clf: 1.19564 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00567 , ClfTrain/test_iter_loss: 0.00299 [train: 35, 1000 / 1000] FPS: 7.9 (8.2) , Loss/total: 3.46628 , Loss/segm: 1.41284 , Stats/acc: 0.81263 , Stats/clf_acc: 0.81293 , Stats/clf_peak_dist: 1.49386 , Loss/target_clf: 0.29033 , Loss/test_init_clf: 0.56673 , Loss/test_iter_clf: 1.19637 , ClfTrain/test_loss: 0.00290 , ClfTrain/test_init_loss: 0.00567 , ClfTrain/test_iter_loss: 0.00299 Training crashed at epoch 35 Traceback for the error! Traceback (most recent call last): File "/home/mytest/myprojects/pytracking-master/ltr/trainers/base_trainer.py", line 70, in train self.train_epoch() File "/home/mytest/myprojects/pytracking-master/ltr/trainers/ltr_trainer.py", line 93, in train_epoch self.cycle_dataset(loader) File "/home/mytest/myprojects/pytracking-master/ltr/trainers/ltr_trainer.py", line 66, in cycle_dataset for i, data in enumerate(loader, 1): File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next data = self._next_data() File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data return self._process_data(data) File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data data.reraise() File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/mytest/anaconda3/envs/pytrack/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/mytest/myprojects/pytracking-master/ltr/data/sampler.py", line 377, in getitem return self.processing(data) File "/home/mytest/myprojects/pytracking-master/ltr/data/processing.py", line 1712, in call image=data['train_images'], bbox=data['train_anno'], mask=data['train_masks']) File "/home/mytest/myprojects/pytracking-master/ltr/data/transforms.py", line 63, in call return tuple(out[v] for v in var_names) File "/home/mytest/myprojects/pytracking-master/ltr/data/transforms.py", line 63, in return tuple(out[v] for v in var_names) KeyError: 'mask'

Restarting training from last epoch ... Finished training!

Process finished with exit code 0

mywebinfo65536 avatar Oct 13 '22 08:10 mywebinfo65536

Hi @mywebinfo65536. This is a bit strange, it looks like missing data that the sampler tries to load. Is it reproducible ? Could you tell me what's the size of self.sequence_list for your got10k ?

mattpfr avatar Oct 13 '22 21:10 mattpfr

Hi @mywebinfo65536. This is a bit strange, it looks like missing data that the sampler tries to load. Is it reproducible ? Could you tell me what's the size of self.sequence_list for your got10k ?

Hello mattpfr, the original self.sequence_list size is 7086 and after handled the [GOT-10k_Train_004419], the final size is 7085

mywebinfo65536 avatar Oct 14 '22 03:10 mywebinfo65536

By the way, could you tell me the trainning speed (FPS: 6-8) is right? I just felt it a bit low in my trainning.

mywebinfo65536 avatar Oct 14 '22 03:10 mywebinfo65536

Hi @mywebinfo65536. This is a bit strange, it looks like missing data that the sampler tries to load. Is it reproducible ? Could you tell me what's the size of self.sequence_list for your got10k ?

Hello mattpfr, the original self.sequence_list size is 7086 and after handled the [GOT-10k_Train_004419], the final size is 7085

This seems alright to me. the 004419 is indeed missing, but if it is not in the sequence_list, it should not be a problem. Maybe it would be helpful to log somewhere the sequence currently processed ?

mattpfr avatar Oct 15 '22 17:10 mattpfr

By the way, could you tell me the trainning speed (FPS: 6-8) is right? I just felt it a bit low in my trainning.

I guess it depends on the hardware you use. When I trained, I was at around ~10 FPS. So your numbers seem normal

mattpfr avatar Oct 15 '22 17:10 mattpfr

@mywebinfo65536 do you have any further issues ? Or is it working for you now ?

mattpfr avatar Feb 13 '23 13:02 mattpfr

@mywebinfo65536 do you have any further issues ? Or is it working for you now ?

Hi mattpfr, thanks for your reply, I now have no further issues, but it still not be fixed for me, I will check it later on my free time.

mywebinfo65536 avatar Feb 14 '23 02:02 mywebinfo65536

@mywebinfo65536 thanks, yes please let me know indeed so that I know if there is something to do or not with this ticket.

mattpfr avatar Feb 15 '23 09:02 mattpfr

I also encountered this error in the 5th epoch of training, how to solve it?

DAVIE-LAU avatar Mar 15 '23 07:03 DAVIE-LAU