NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Unexpected step of saving checkpoint

Open sophiayyya opened this issue 2 months ago • 1 comments

Customer resume checkpoints from step 2048. And save with step 256. The saved checkpoints should be 2048,2304,2560,2816,... But the saved checkpoints are: 2048,2304,2496,, 2560, 2624, 2816 Please check the attached for details. Why are there unexpected saved checkpoint? And why is there "unfinished" prefix?

Image

sophiayyya avatar Nov 19 '25 14:11 sophiayyya

@sophiayyya This looks unexpected. The -unfinished suffix should be cleaned up after the checkpoint is saved.

https://github.com/NVIDIA-NeMo/NeMo/blob/e0f7ca9ffe5fe9ea0cfb5fe7ce33eeb34e7bb189/nemo/lightning/pytorch/callbacks/model_checkpoint.py#L584-L609

Is the job given enough time to save the checkpoint? Do you have a minimal reproducer?

terrykong avatar Nov 21 '25 06:11 terrykong

Closing this due to inactivity.

oyilmaz-nvidia avatar Dec 10 '25 18:12 oyilmaz-nvidia