DALI icon indicating copy to clipboard operation
DALI copied to clipboard

question about OOM errors

Open danbider opened this issue 3 years ago • 3 comments

a couple of lightning-pose users have been reporting new OOM errors starting from last Thursday. We haven't pushed new updates affecting DALI on our end. Here's what one of my users reports:

So here’s a summary of the errors in order:

Thurs Afternoon: Training unsupervised model gave me a memory error: “CUDA out of memory.” I tried clearing the cache, seemed ok at first but then eventually gave me the same error. I was happy with the supervised so I gave up on fixing this error. For training I pointed it to a folder of 2 .mp4 videos (one 91.8 MB and the other 162.2 MB) so I’m surprised that it gave me this memory error. It seemed to be better by decreasing the batch size but never successfully went through.

Thurs Evening: Predicting a folder of 25 videos (range of 91.8 MB to 538.8 MB), I got an error: “Current pipeline object is no longer valid.” I tried re-running predict_new_vids.py but kept on getting the same message. I paused the grid session and then re-started. The grid session became stuck initializing and was only stopped 3 days later (after I ran out of credits).

Friday Evening: Started a new session and tried to run an unsupervised model and I got an error: “Current pipeline object is no longer valid.” Ended the session and I never tried to resume that session.

danbider avatar Jun 06 '22 18:06 danbider

Hi @danbider,

DALI 1.14 has been released a week ago so it could be it. Can you tell us how exactly we can reproduce that problem?

JanuszL avatar Jun 07 '22 07:06 JanuszL

I see that also a new version of pytorch-lightning has been released, so it could be the second thing that may cause problems.

JanuszL avatar Jun 07 '22 08:06 JanuszL

@danbider Thanks for reporting the issue. Please provide details on how to reproduce the issue:

  • hardware configuration (number and type of GPUs)
  • training scripts and command lines
  • dataset (it can be a toy dataset or you can point us to a publicly available one)

mzient avatar Jun 07 '22 08:06 mzient