DEIM icon indicating copy to clipboard operation
DEIM copied to clipboard

AssertionError assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Open Sunburst7 opened this issue 9 months ago • 12 comments

Hello,I have a bug randomly when i train by custome dataset, can you give me some idea to fix this bug😭😭? there is log:

Epoch: [48]  [  0/112]  eta: 0:05:41  lr: 0.000013  loss: 66.9744 (66.9744)  loss_bbox: 0.1108 (0.1108)  loss_bbox_aux_0: 0.1473 (0.1473)  loss_bbox_aux_1: 0.1123 (0.1123)  loss_bbox_aux_2: 0.1125 (0.1125)  loss_bbox_aux_3: 0.1110 (0.1110)  loss_bbox_aux_4: 0.1108 (0.1108)  loss_bbox_dn_0: 0.1200 (0.1200)  loss_bbox_dn_1: 0.0846 (0.0846)  loss_bbox_dn_2: 0.0808 (0.0808)  loss_bbox_dn_3: 0.0801 (0.0801)  loss_bbox_dn_4: 0.0801 (0.0801)  loss_bbox_dn_5: 0.0801 (0.0801)  loss_bbox_dn_pre: 0.1200 (0.1200)  loss_bbox_enc_0: 0.1741 (0.1741)  loss_bbox_pre: 0.1453 (0.1453)  loss_ddf_aux_0: 1.2560 (1.2560)  loss_ddf_aux_1: 0.1161 (0.1161)  loss_ddf_aux_2: 0.0161 (0.0161)  loss_ddf_aux_3: 0.0022 (0.0022)  loss_ddf_aux_4: -0.0001 (-0.0001)  loss_ddf_dn_0: 0.8419 (0.8419)  loss_ddf_dn_1: 0.0872 (0.0872)  loss_ddf_dn_2: 0.0042 (0.0042)  loss_ddf_dn_3: 0.0004 (0.0004)  loss_ddf_dn_4: 0.0001 (0.0001)  loss_fgl: 1.4658 (1.4658)  loss_fgl_aux_0: 1.3853 (1.3853)  loss_fgl_aux_1: 1.3759 (1.3759)  loss_fgl_aux_2: 1.4444 (1.4444)  loss_fgl_aux_3: 1.4465 (1.4465)  loss_fgl_aux_4: 1.4628 (1.4628)  loss_fgl_dn_0: 1.3113 (1.3113)  loss_fgl_dn_1: 1.2754 (1.2754)  loss_fgl_dn_2: 1.2629 (1.2629)  loss_fgl_dn_3: 1.2614 (1.2614)  loss_fgl_dn_4: 1.2619 (1.2619)  loss_fgl_dn_5: 1.2620 (1.2620)  loss_giou: 0.4707 (0.4707)  loss_giou_aux_0: 0.5837 (0.5837)  loss_giou_aux_1: 0.4715 (0.4715)  loss_giou_aux_2: 0.4781 (0.4781)  loss_giou_aux_3: 0.4721 (0.4721)  loss_giou_aux_4: 0.4707 (0.4707)  loss_giou_dn_0: 0.6187 (0.6187)  loss_giou_dn_1: 0.4390 (0.4390)  loss_giou_dn_2: 0.4201 (0.4201)  loss_giou_dn_3: 0.4168 (0.4168)  loss_giou_dn_4: 0.4165 (0.4165)  loss_giou_dn_5: 0.4165 (0.4165)  loss_giou_dn_pre: 0.6132 (0.6132)  loss_giou_enc_0: 0.6817 (0.6817)  loss_giou_pre: 0.5724 (0.5724)  loss_mal: 3.2324 (3.2324)  loss_mal_aux_0: 2.3652 (2.3652)  loss_mal_aux_1: 3.9062 (3.9062)  loss_mal_aux_2: 5.6055 (5.6055)  loss_mal_aux_3: 3.3750 (3.3750)  loss_mal_aux_4: 3.2148 (3.2148)  loss_mal_dn_0: 0.8051 (0.8051)  loss_mal_dn_1: 0.8197 (0.8197)  loss_mal_dn_2: 0.7374 (0.7374)  loss_mal_dn_3: 0.7656 (0.7656)  loss_mal_dn_4: 0.7122 (0.7122)  loss_mal_dn_5: 0.7305 (0.7305)  loss_mal_dn_pre: 0.7861 (0.7861)  loss_mal_enc_0: 5.3057 (5.3057)  loss_mal_pre: 3.3848 (3.3848)  loss_obj_ll: 0.4802 (0.4802)  loss_obj_ll_aux_0: 0.6456 (0.6456)  loss_obj_ll_aux_1: 0.5525 (0.5525)  loss_obj_ll_aux_2: 0.6497 (0.6497)  loss_obj_ll_aux_3: 0.6068 (0.6068)  loss_obj_ll_aux_4: 0.5420 (0.5420)  time: 3.0448  data: 0.8962  max mem: 6982
tensor([[[nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         ...,
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan]],

        [[nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         ...,
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan]],

        [[nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         ...,
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan]],

        [[nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         ...,
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan]],

        [[nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         ...,
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan]],

        [[nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         ...,
         [nan, nan, nan, nan],
         [nan, nan, nan, nan],
         [nan, nan, nan, nan]]], device='cuda:0', grad_fn=<SelectBackward0>)

[rank2]: Traceback (most recent call last):
[rank2]:   File "/data2/hh/workspace/DEIM/train.py", line 84, in <module>
[rank2]:     main(args)
[rank2]:   File "/data2/hh/workspace/DEIM/train.py", line 54, in main
[rank2]:     solver.fit()
[rank2]:   File "/data2/hh/workspace/DEIM/engine/solver/det_solver.py", line 76, in fit
[rank2]:     train_stats = train_one_epoch(
[rank2]:                   ^^^^^^^^^^^^^^^^
[rank2]:   File "/data2/hh/workspace/DEIM/engine/solver/det_engine.py", line 65, in train_one_epoch
[rank2]:     loss_dict = criterion(samples, outputs, targets, **metas)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data2/hh/workspace/DEIM/engine/deim/deim_criterion.py", line 348, in forward
[rank2]:     indices = self.matcher(outputs_without_aux, targets)['indices']
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data2/hh/workspace/DEIM/engine/deim/matcher.py", line 101, in forward
[rank2]:     cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
[rank2]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data2/hh/workspace/DEIM/engine/deim/box_ops.py", line 53, in generalized_box_iou
[rank2]:     assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: AssertionError

Sunburst7 avatar Apr 18 '25 03:04 Sunburst7

I got same error.

flydragon2018 avatar Apr 21 '25 13:04 flydragon2018

You can probably resolve the issue by setting the number of classes to n+1. For example, if you have 10 classes, set num_classes = 11

“nan" problem in training custom dataset #15

@Sunburst7 @flydragon2018

kimtaehyeong avatar May 06 '25 15:05 kimtaehyeong

You can probably resolve the issue by setting the number of classes to n+1. For example, if you have 10 classes, set num_classes = 11

“nan" problem in training custom dataset #15

@Sunburst7 @flydragon2018

yes, i was training in my custome dataset. i will try this solution. could you give a explaination about this error.

Sunburst7 avatar May 07 '25 12:05 Sunburst7

Actually I found solution somewhere else that need to set the dtype=float64. But I didn't try. just change to use other model.

flydragon2018 avatar May 16 '25 03:05 flydragon2018

I had this issue too, but resolved it by lowering the learning rate and fine-tuning the regularization terms.

Another reply regarding this issue: I have experienced a similar issue after a couple of training epoches. Removing --use-amp from the command fixed it

shblyy avatar May 16 '25 11:05 shblyy

It's definetely not only when you train on a custom dataset. Basically if you change anything in the model, the error can occur frequently.

altair199797 avatar Jun 26 '25 09:06 altair199797

Hi all, for those still facing this NaN -> AssertionError when training with Automatic Mixed Precision (--use-amp), I believe I've identified the root cause and have a stable fix.

The issue seems to stem from the default eps=1e-8 in the AdamW optimizer. This value is too small for the limited precision of float16, causing the denominator in the optimizer's update step to underflow to zero. This division by zero results in the NaN values that eventually crash the training.

I was able to fix this permanently by setting a slightly larger epsilon (1e-7) in my .yml config file:

optimizer:
  type: AdamW
  # ... other params
  eps: 0.0000001

This prevents the underflow and allows training to proceed stably with AMP. This is a known interaction with Adam/AMP, and you can find more technical details in this PyTorch issue: https://github.com/pytorch/pytorch/issues/26218.

Hope this helps others!

EwertzJN avatar Jul 31 '25 09:07 EwertzJN

Am still seeing the same error even after using eps=0.0000001. Am running training on custom dataset, from scratch. Are you able to run full training with above suggested fix?

csampat-a avatar Aug 03 '25 01:08 csampat-a

You can also deactivate the amp (automatic mixed precision...) during training. If that's not possible, as training'd need too much memory, you can deactivate it just for the attention operations: Sth like: with torch.autocast(enabled=False) Attention operation...

It fixed it for me.

altair199797 avatar Aug 03 '25 09:08 altair199797

@csampat-a, that's interesting. Yes, I was able to run multiple full training runs also on a custom dataset with my suggested fix. Before, I ran into the exact same problem with the failed assertion because of the NaN values. At first, I also tried to deactivate amp like @altair199797 suggests, which also might help, but I wasn't happy with the negative impact on the training time and memory consumption of this possible fix.

EwertzJN avatar Aug 04 '25 08:08 EwertzJN

@csampat-a, that's interesting. Yes, I was able to run multiple full training runs also on a custom dataset with my suggested fix. Before, I ran into the exact same problem with the failed assertion because of the NaN values. At first, I also tried to deactivate amp like @altair199797 suggests, which also might help, but I wasn't happy with the negative impact on the training time and memory consumption of this possible fix.

If you just deactivate amp on the attention operation, the disadvantages are minimal. Just for the SDA. That is the source of the problem.

altair199797 avatar Aug 04 '25 13:08 altair199797

@csampat-a这很有意思。是的,我用我建议的修复方法,在自定义数据集上也成功运行了多次完整的训练。之前,我遇到了完全相同的问题,断言失败是因为存在 NaN 值。一开始,我也尝试过禁用 amp,比如@altair199797建议这样做,这或许也有帮助,但我对这种可能的解决方案对训练时间和内存消耗的负面影响并不满意。

I was able to successfully train the model on the FLIR dataset. However, when I switched to a multi-modal configuration, the assertion error occurred. While testing with controlled variables, I found that if the backbone is ResNet50, the NaN issue does not appear. But with HGnet, the problem consistently emerges between epochs 9 to 12 on my modified multi-modal DEIM framework. So, I'm pretty sure the problem is that HGnet's outputs are highly unstable.

However, there is a significant difference in the number of parameters between ResNet50 and HGnet-L. I'm still working on finding a solution to this problem.

wuruoyu1997coder avatar Nov 24 '25 05:11 wuruoyu1997coder