The error when reached 90th training epoch
I use 'CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/deim_dfine/deim_hgnetv2_n_GC10.yml --use-amp --seed=0' to train the model. When I reached my 90th round of training epoch, the error occurs as follow:
How can I solve this problem? Looking forward to your reply!
Best wishes!
Hi, I faced the exact same issue where training would crash with a NaN tensor and an AssertionError after many epochs. This seems to happen specifically when using Automatic Mixed Precision (--use-amp).
The root cause is likely the default eps=1e-8 value in the AdamW optimizer, which is too small for float16 precision. This can lead to a division-by-zero error in the optimizer step, resulting in NaN values.
I was able to fix this and complete the training successfully by setting a slightly larger epsilon (1e-7) in the optimizer configuration:
# In your .yml config file
optimizer:
type: AdamW
# ... other params
eps: 0.0000001
This change stabilizes the training with AMP enabled. This is a known interaction, and you can find more technical details in this PyTorch issue: https://github.com/pytorch/pytorch/issues/26218.
Hope this helps you solve the problem!
@EwertzJN Nice! We have trained hundreds of DEIM models and occasionally encountered similar NaN issues, which are not consistently reproducible but occur randomly. This is likely due to instability in AMP. We hope you have resolved the problem with his @EwertzJN suggestion, or you can try turning off AMP.