Error nan value occurs

Open TomX32 opened this issue 1 year ago • 1 comments

I’ve encountered an issue during multiple training sessions where a memory error may occur despite having ample GPU memory available each time. Here are the details:

Issue Description: A memory error occasionally arises during the training process. Training Dataset: I am using a custom MOT (Multiple Object Tracking) dataset. Dataset Details: The dataset contains two classes. Environment Information: The GPU memory is sufficient; specific memory size and training configurations can be provided.

    return forward_call(*args, **kwargs)
           │             │       └ {}
           │             └ ({'pred_logits': tensor([[[-2.1465, -5.6719],
           │                        [-2.2617, -4.8477],
           │                        [-2.4082, -5.9219],
           │                        ...,
           │                      ...
           └ <bound method HungarianMatcherDynamicK.forward of HungarianMatcherDynamicK()>

  File "/rydata/mot/DiffusionTrack/tools/../diffusion/models/diffusion_losses.py", line 387, in forward
    assert not torch.any(torch.isnan(cost)),"Error nan value occurs"
               │     │   │     │     └ tensor([[104.9505, 101.8805, 109.8512],
               │     │   │     │               [111.1139, 107.6783, 112.0708],
               │     │   │     │               [106.6360, 110.6017, 104.6933],
               │     │   │     │            ...
               │     │   │     └ <built-in method isnan of type object at 0x2ad86cf91500>
               │     │   └ <module 'torch' from '/rydata/mot/dftrack_env/lib/python3.8/site-packages/torch/__init__.py'>
               │     └ <built-in method any of type object at 0x2ad86cf91500>
               └ <module 'torch' from '/rydata/mot/dftrack_env/lib/python3.8/site-packages/torch/__init__.py'>

AssertionError: Error nan value occurs
Total memory: 81920 MB, Used memory: 0 MB, Max allocatable memory: 77824 MB
Block memory to allocate: 77824 MB
RuntimeError during memory allocation: CUDA out of memory. Tried to allocate 76.00 GiB (GPU 1; 79.33 GiB total capacity; 1.23 GiB already allocated; 41.31 GiB free; 1.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Skipping memory pre-allocation due to insufficient memory.
loading annotations into memory...
Done (t=0.65s)
creating index...
index created!```

Feb 14 '25 01:02 TomX32

Do not use half-precision training as it may result in NAN

Apr 26 '25 12:04 RainBowLuoCS