verl OOM when resuming from Checkpoint

We are training a 14B model that is working to train on a machine with 8 H100s and made it all the way to step 56/200. However, when we try to resume from checkpoint saved at step 40 we get an OOM error very quickly after 1 more step at 41. Any ideas why it would error out there when the previous training was able to get past this step?

Mar 30 '25 20:03 dannnnthemannnn

You could try torch.cuda.empty_cache() after loading checkpoing.

Apr 01 '25 09:04 wplf

actually, during the training, each time after _save_checkpoint(), it just go on to train, does not call _load_checkpoints( ). i add a torch.cuda.empty_cache() at the last line of _save_checkpoint(), but the problem still occurs. is it because my gpu_utilization = 0.95 is too high?

Apr 08 '25 02:04 SophieZheng998

same issue. And theoretically it doesn't make any sense...

Jun 23 '25 06:06 KawaiiNotHawaii

you should setting gpu_utilization smaller before save checkpoint, that solves my issue

Jun 23 '25 07:06 SophieZheng998

@SophieZheng998 can you elaborate a bit more on how you resolved this?

Jul 01 '25 19:07 sinamoeini

same issue

Aug 05 '25 03:08 YankaiChen0308

same

Nov 18 '25 08:11 VegetaPn