DEIM icon indicating copy to clipboard operation
DEIM copied to clipboard

The error when reached 90th training epoch

Open Livier18 opened this issue 8 months ago • 2 comments

I use 'CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/deim_dfine/deim_hgnetv2_n_GC10.yml --use-amp --seed=0' to train the model. When I reached my 90th round of training epoch, the error occurs as follow:

Image

Image

How can I solve this problem? Looking forward to your reply!

Best wishes!

Livier18 avatar May 27 '25 02:05 Livier18

Hi, I faced the exact same issue where training would crash with a NaN tensor and an AssertionError after many epochs. This seems to happen specifically when using Automatic Mixed Precision (--use-amp).

The root cause is likely the default eps=1e-8 value in the AdamW optimizer, which is too small for float16 precision. This can lead to a division-by-zero error in the optimizer step, resulting in NaN values.

I was able to fix this and complete the training successfully by setting a slightly larger epsilon (1e-7) in the optimizer configuration:

# In your .yml config file
optimizer:
  type: AdamW
  # ... other params
  eps: 0.0000001

This change stabilizes the training with AMP enabled. This is a known interaction, and you can find more technical details in this PyTorch issue: https://github.com/pytorch/pytorch/issues/26218.

Hope this helps you solve the problem!

EwertzJN avatar Jul 31 '25 09:07 EwertzJN

@EwertzJN Nice! We have trained hundreds of DEIM models and occasionally encountered similar NaN issues, which are not consistently reproducible but occur randomly. This is likely due to instability in AMP. We hope you have resolved the problem with his @EwertzJN suggestion, or you can try turning off AMP.

ShihuaHuang95 avatar Nov 01 '25 01:11 ShihuaHuang95