verl icon indicating copy to clipboard operation
verl copied to clipboard

OOM when resuming from Checkpoint

Open dannnnthemannnn opened this issue 1 year ago • 7 comments

We are training a 14B model that is working to train on a machine with 8 H100s and made it all the way to step 56/200. However, when we try to resume from checkpoint saved at step 40 we get an OOM error very quickly after 1 more step at 41. Any ideas why it would error out there when the previous training was able to get past this step?

dannnnthemannnn avatar Mar 30 '25 20:03 dannnnthemannnn

You could try torch.cuda.empty_cache() after loading checkpoing.

wplf avatar Apr 01 '25 09:04 wplf

actually, during the training, each time after _save_checkpoint(), it just go on to train, does not call _load_checkpoints( ). i add a torch.cuda.empty_cache() at the last line of _save_checkpoint(), but the problem still occurs. is it because my gpu_utilization = 0.95 is too high?

SophieZheng998 avatar Apr 08 '25 02:04 SophieZheng998

same issue. And theoretically it doesn't make any sense...

KawaiiNotHawaii avatar Jun 23 '25 06:06 KawaiiNotHawaii

you should setting gpu_utilization smaller before save checkpoint, that solves my issue

SophieZheng998 avatar Jun 23 '25 07:06 SophieZheng998

@SophieZheng998 can you elaborate a bit more on how you resolved this?

sinamoeini avatar Jul 01 '25 19:07 sinamoeini

same issue

YankaiChen0308 avatar Aug 05 '25 03:08 YankaiChen0308

same

VegetaPn avatar Nov 18 '25 08:11 VegetaPn