DALI question about OOM errors

a couple of lightning-pose users have been reporting new OOM errors starting from last Thursday. We haven't pushed new updates affecting DALI on our end. Here's what one of my users reports:

So here’s a summary of the errors in order:

Thurs Afternoon: Training unsupervised model gave me a memory error: “CUDA out of memory.” I tried clearing the cache, seemed ok at first but then eventually gave me the same error. I was happy with the supervised so I gave up on fixing this error. For training I pointed it to a folder of 2 .mp4 videos (one 91.8 MB and the other 162.2 MB) so I’m surprised that it gave me this memory error. It seemed to be better by decreasing the batch size but never successfully went through.

Thurs Evening: Predicting a folder of 25 videos (range of 91.8 MB to 538.8 MB), I got an error: “Current pipeline object is no longer valid.” I tried re-running predict_new_vids.py but kept on getting the same message. I paused the grid session and then re-started. The grid session became stuck initializing and was only stopped 3 days later (after I ran out of credits).

Friday Evening: Started a new session and tried to run an unsupervised model and I got an error: “Current pipeline object is no longer valid.” Ended the session and I never tried to resume that session.

Jun 06 '22 18:06 danbider

Hi @danbider,

DALI 1.14 has been released a week ago so it could be it. Can you tell us how exactly we can reproduce that problem?

Jun 07 '22 07:06 JanuszL

I see that also a new version of pytorch-lightning has been released, so it could be the second thing that may cause problems.

Jun 07 '22 08:06 JanuszL

@danbider Thanks for reporting the issue. Please provide details on how to reproduce the issue:

hardware configuration (number and type of GPUs)
training scripts and command lines
dataset (it can be a toy dataset or you can point us to a publicly available one)

Jun 07 '22 08:06 mzient