OOM when resuming from Checkpoint
We are training a 14B model that is working to train on a machine with 8 H100s and made it all the way to step 56/200. However, when we try to resume from checkpoint saved at step 40 we get an OOM error very quickly after 1 more step at 41. Any ideas why it would error out there when the previous training was able to get past this step?
You could try torch.cuda.empty_cache() after loading checkpoing.
actually, during the training, each time after _save_checkpoint(), it just go on to train, does not call _load_checkpoints( ). i add a torch.cuda.empty_cache() at the last line of _save_checkpoint(), but the problem still occurs. is it because my gpu_utilization = 0.95 is too high?
same issue. And theoretically it doesn't make any sense...
you should setting gpu_utilization smaller before save checkpoint, that solves my issue
@SophieZheng998 can you elaborate a bit more on how you resolved this?
same issue
same