AssertionError assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
Hello,I have a bug randomly when i train by custome dataset, can you give me some idea to fix this bug😭😭? there is log:
Epoch: [48] [ 0/112] eta: 0:05:41 lr: 0.000013 loss: 66.9744 (66.9744) loss_bbox: 0.1108 (0.1108) loss_bbox_aux_0: 0.1473 (0.1473) loss_bbox_aux_1: 0.1123 (0.1123) loss_bbox_aux_2: 0.1125 (0.1125) loss_bbox_aux_3: 0.1110 (0.1110) loss_bbox_aux_4: 0.1108 (0.1108) loss_bbox_dn_0: 0.1200 (0.1200) loss_bbox_dn_1: 0.0846 (0.0846) loss_bbox_dn_2: 0.0808 (0.0808) loss_bbox_dn_3: 0.0801 (0.0801) loss_bbox_dn_4: 0.0801 (0.0801) loss_bbox_dn_5: 0.0801 (0.0801) loss_bbox_dn_pre: 0.1200 (0.1200) loss_bbox_enc_0: 0.1741 (0.1741) loss_bbox_pre: 0.1453 (0.1453) loss_ddf_aux_0: 1.2560 (1.2560) loss_ddf_aux_1: 0.1161 (0.1161) loss_ddf_aux_2: 0.0161 (0.0161) loss_ddf_aux_3: 0.0022 (0.0022) loss_ddf_aux_4: -0.0001 (-0.0001) loss_ddf_dn_0: 0.8419 (0.8419) loss_ddf_dn_1: 0.0872 (0.0872) loss_ddf_dn_2: 0.0042 (0.0042) loss_ddf_dn_3: 0.0004 (0.0004) loss_ddf_dn_4: 0.0001 (0.0001) loss_fgl: 1.4658 (1.4658) loss_fgl_aux_0: 1.3853 (1.3853) loss_fgl_aux_1: 1.3759 (1.3759) loss_fgl_aux_2: 1.4444 (1.4444) loss_fgl_aux_3: 1.4465 (1.4465) loss_fgl_aux_4: 1.4628 (1.4628) loss_fgl_dn_0: 1.3113 (1.3113) loss_fgl_dn_1: 1.2754 (1.2754) loss_fgl_dn_2: 1.2629 (1.2629) loss_fgl_dn_3: 1.2614 (1.2614) loss_fgl_dn_4: 1.2619 (1.2619) loss_fgl_dn_5: 1.2620 (1.2620) loss_giou: 0.4707 (0.4707) loss_giou_aux_0: 0.5837 (0.5837) loss_giou_aux_1: 0.4715 (0.4715) loss_giou_aux_2: 0.4781 (0.4781) loss_giou_aux_3: 0.4721 (0.4721) loss_giou_aux_4: 0.4707 (0.4707) loss_giou_dn_0: 0.6187 (0.6187) loss_giou_dn_1: 0.4390 (0.4390) loss_giou_dn_2: 0.4201 (0.4201) loss_giou_dn_3: 0.4168 (0.4168) loss_giou_dn_4: 0.4165 (0.4165) loss_giou_dn_5: 0.4165 (0.4165) loss_giou_dn_pre: 0.6132 (0.6132) loss_giou_enc_0: 0.6817 (0.6817) loss_giou_pre: 0.5724 (0.5724) loss_mal: 3.2324 (3.2324) loss_mal_aux_0: 2.3652 (2.3652) loss_mal_aux_1: 3.9062 (3.9062) loss_mal_aux_2: 5.6055 (5.6055) loss_mal_aux_3: 3.3750 (3.3750) loss_mal_aux_4: 3.2148 (3.2148) loss_mal_dn_0: 0.8051 (0.8051) loss_mal_dn_1: 0.8197 (0.8197) loss_mal_dn_2: 0.7374 (0.7374) loss_mal_dn_3: 0.7656 (0.7656) loss_mal_dn_4: 0.7122 (0.7122) loss_mal_dn_5: 0.7305 (0.7305) loss_mal_dn_pre: 0.7861 (0.7861) loss_mal_enc_0: 5.3057 (5.3057) loss_mal_pre: 3.3848 (3.3848) loss_obj_ll: 0.4802 (0.4802) loss_obj_ll_aux_0: 0.6456 (0.6456) loss_obj_ll_aux_1: 0.5525 (0.5525) loss_obj_ll_aux_2: 0.6497 (0.6497) loss_obj_ll_aux_3: 0.6068 (0.6068) loss_obj_ll_aux_4: 0.5420 (0.5420) time: 3.0448 data: 0.8962 max mem: 6982
tensor([[[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]],
[[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]],
[[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]],
[[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]],
[[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]],
[[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]]], device='cuda:0', grad_fn=<SelectBackward0>)
[rank2]: Traceback (most recent call last):
[rank2]: File "/data2/hh/workspace/DEIM/train.py", line 84, in <module>
[rank2]: main(args)
[rank2]: File "/data2/hh/workspace/DEIM/train.py", line 54, in main
[rank2]: solver.fit()
[rank2]: File "/data2/hh/workspace/DEIM/engine/solver/det_solver.py", line 76, in fit
[rank2]: train_stats = train_one_epoch(
[rank2]: ^^^^^^^^^^^^^^^^
[rank2]: File "/data2/hh/workspace/DEIM/engine/solver/det_engine.py", line 65, in train_one_epoch
[rank2]: loss_dict = criterion(samples, outputs, targets, **metas)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data2/hh/workspace/DEIM/engine/deim/deim_criterion.py", line 348, in forward
[rank2]: indices = self.matcher(outputs_without_aux, targets)['indices']
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data2/hh/anaconda3/envs/owrt/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data2/hh/workspace/DEIM/engine/deim/matcher.py", line 101, in forward
[rank2]: cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data2/hh/workspace/DEIM/engine/deim/box_ops.py", line 53, in generalized_box_iou
[rank2]: assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: AssertionError
I got same error.
You can probably resolve the issue by setting the number of classes to n+1. For example, if you have 10 classes, set num_classes = 11
“nan" problem in training custom dataset #15
@Sunburst7 @flydragon2018
You can probably resolve the issue by setting the number of classes to n+1. For example, if you have 10 classes, set num_classes = 11
“nan" problem in training custom dataset #15
yes, i was training in my custome dataset. i will try this solution. could you give a explaination about this error.
Actually I found solution somewhere else that need to set the dtype=float64. But I didn't try. just change to use other model.
I had this issue too, but resolved it by lowering the learning rate and fine-tuning the regularization terms.
Another reply regarding this issue: I have experienced a similar issue after a couple of training epoches. Removing --use-amp from the command fixed it
It's definetely not only when you train on a custom dataset. Basically if you change anything in the model, the error can occur frequently.
Hi all, for those still facing this NaN -> AssertionError when training with Automatic Mixed Precision (--use-amp), I believe I've identified the root cause and have a stable fix.
The issue seems to stem from the default eps=1e-8 in the AdamW optimizer. This value is too small for the limited precision of float16, causing the denominator in the optimizer's update step to underflow to zero. This division by zero results in the NaN values that eventually crash the training.
I was able to fix this permanently by setting a slightly larger epsilon (1e-7) in my .yml config file:
optimizer:
type: AdamW
# ... other params
eps: 0.0000001
This prevents the underflow and allows training to proceed stably with AMP. This is a known interaction with Adam/AMP, and you can find more technical details in this PyTorch issue: https://github.com/pytorch/pytorch/issues/26218.
Hope this helps others!
Am still seeing the same error even after using eps=0.0000001. Am running training on custom dataset, from scratch. Are you able to run full training with above suggested fix?
You can also deactivate the amp (automatic mixed precision...) during training. If that's not possible, as training'd need too much memory, you can deactivate it just for the attention operations:
Sth like:
with torch.autocast(enabled=False) Attention operation...
It fixed it for me.
@csampat-a, that's interesting. Yes, I was able to run multiple full training runs also on a custom dataset with my suggested fix. Before, I ran into the exact same problem with the failed assertion because of the NaN values. At first, I also tried to deactivate amp like @altair199797 suggests, which also might help, but I wasn't happy with the negative impact on the training time and memory consumption of this possible fix.
@csampat-a, that's interesting. Yes, I was able to run multiple full training runs also on a custom dataset with my suggested fix. Before, I ran into the exact same problem with the failed assertion because of the NaN values. At first, I also tried to deactivate amp like @altair199797 suggests, which also might help, but I wasn't happy with the negative impact on the training time and memory consumption of this possible fix.
If you just deactivate amp on the attention operation, the disadvantages are minimal. Just for the SDA. That is the source of the problem.
@csampat-a这很有意思。是的,我用我建议的修复方法,在自定义数据集上也成功运行了多次完整的训练。之前,我遇到了完全相同的问题,断言失败是因为存在 NaN 值。一开始,我也尝试过禁用 amp,比如@altair199797建议这样做,这或许也有帮助,但我对这种可能的解决方案对训练时间和内存消耗的负面影响并不满意。
I was able to successfully train the model on the FLIR dataset. However, when I switched to a multi-modal configuration, the assertion error occurred. While testing with controlled variables, I found that if the backbone is ResNet50, the NaN issue does not appear. But with HGnet, the problem consistently emerges between epochs 9 to 12 on my modified multi-modal DEIM framework. So, I'm pretty sure the problem is that HGnet's outputs are highly unstable.
However, there is a significant difference in the number of parameters between ResNet50 and HGnet-L. I'm still working on finding a solution to this problem.