Unable to reproduce result reported in paper
I trained the DEIM-D-FINE-X model on 8 A100 GPUs (40GB each), and at epoch 50, I see AP [0.5:0.95] to be 55.8 mAP, while the paper reported 56.5 mAP (Table 1). Why is this discrepancy? Did anyone else notice this?
Also, the training got stopped at 50 epochs with following error:
[rank2]: Traceback (most recent call last):
[rank2]: File "<path>/DEIM/train.py", line 84, in <module>
[rank2]: main(args)
[rank2]: File "<path>/DEIM/train.py", line 54, in main
[rank2]: solver.fit()
[rank2]: File "<path>/DEIM/engine/solver/det_solver.py", line 72, in fit
[rank2]: self.load_resume_state(str(self.output_dir / 'best_stg1.pth'))
[rank2]: File "<path>/DEIM/engine/solver/_solver.py", line 159, in load_resume_state
[rank2]: state = torch.load(path, map_location='cpu')
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/envs/deim_p312/lib/python3.12/site-packages/torch/serialization.py", line 1384, in load
[rank2]: return _legacy_load(
[rank2]: ^^^^^^^^^^^^^
[rank2]: File "/opt/conda/envs/deim_p312/lib/python3.12/site-packages/torch/serialization.py", line 1628, in _legacy_load
[rank2]: magic_number = pickle_module.load(f, **pickle_load_args)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: EOFError: Ran out of input
[rank5]: Traceback (most recent call last):
[rank5]: File "<path>/DEIM/train.py", line 84, in <module>
[rank5]: main(args)
[rank5]: File "<path>/DEIM/train.py", line 54, in main
[rank5]: solver.fit()
[rank5]: File "<path>/DEIM/engine/solver/det_solver.py", line 72, in fit
[rank5]: self.load_resume_state(str(self.output_dir / 'best_stg1.pth'))
[rank5]: File "<path>/DEIM/engine/solver/_solver.py", line 159, in load_resume_state
[rank5]: state = torch.load(path, map_location='cpu')
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/opt/conda/envs/deim_p312/lib/python3.12/site-packages/torch/serialization.py", line 1360, in load
[rank5]: return _load(
[rank5]: ^^^^^^
[rank5]: File "/opt/conda/envs/deim_p312/lib/python3.12/site-packages/torch/serialization.py", line 1848, in _load
[rank5]: result = unpickler.load()
[rank5]: ^^^^^^^^^^^^^^^^
[rank5]: File "/opt/conda/envs/deim_p312/lib/python3.12/site-packages/torch/serialization.py", line 1812, in persistent_load
[rank5]: typed_storage = load_tensor(
[rank5]: ^^^^^^^^^^^^
[rank5]: File "/opt/conda/envs/deim_p312/lib/python3.12/site-packages/torch/serialization.py", line 1772, in load_tensor
[rank5]: zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)
[rank5]: RuntimeError: PytorchStreamReader failed reading file data/37: file read failed
I suspect this is because of the following config in configs/base/deim.yml:
collate_fn:
mixup_prob: 0.5
mixup_epochs: [4, 29]
stop_epoch: 50 # epoch in [72, ~) stop `multiscales`
But not sure how to fix this yet. Any suggestions?
trained on linux?I encountered the same problem when training DEIM(deim_hgnetv2_s_coco.yml) with custom dataset on Windows 11.I tried to solve this problem by adding 1 to your dataset classes(in configs/dataset/custom_detection.yml). For lower map,i think authors trained multiple times and selected the best map.
Hi @csampat-a Take a look here https://github.com/Intellindust-AI-Lab/DEIM/blob/bc11dfefc08d79756508c7f8b56c29feb909a4f0/configs/deim_dfine/deim_hgnetv2_x_coco.yml#L22-L37 Technically, to reproduce thre exact same results as in the paper you should trained first for 50 epoch with augmentation, and then an additional 8 epochs for optimal ema sear
@xianbeisukisu Am seeing the above error at 50th epoch. I guess that's when the training run is trying to load the best_stg1.pth checkpoint. I was able to resume training from the 50th epoch by re-starting training with -resume.
@SebastianJanampa From Table 1 of the paper, it is mentioned that the performance is achieved for 50 epochs. After training for 58 epochs, I was able to get to 56.16 (using 8 A100 GPUs). I wonder if the numbers reported in the paper were with a different training setup. I probably need to adjust LR for my batch size.
@csampat-a ,
Did you use the configurations provided by this repo? Do not worry about the slight difference (0.3 points). That's normal. If you trained it again, you would get a score different from 56.16 by a small margin. If your goal is to submit a paper, you should use the official reported results in the DEIM paper.
Thank you very much for your interest in and attention to our work. You should strictly follow the official configuration — for example, DEIM-D-FINE-X should be set to 50 + 8 epochs. The extra 8 epochs are introduced in D-FINE for performing a better decay search. Meanwhile, it’s normal to observe small fluctuations within ±0.1 AP.