Unexpected step of saving checkpoint
Customer resume checkpoints from step 2048. And save with step 256. The saved checkpoints should be 2048,2304,2560,2816,... But the saved checkpoints are: 2048,2304,2496,, 2560, 2624, 2816 Please check the attached for details. Why are there unexpected saved checkpoint? And why is there "unfinished" prefix?
@sophiayyya This looks unexpected. The -unfinished suffix should be cleaned up after the checkpoint is saved.
https://github.com/NVIDIA-NeMo/NeMo/blob/e0f7ca9ffe5fe9ea0cfb5fe7ce33eeb34e7bb189/nemo/lightning/pytorch/callbacks/model_checkpoint.py#L584-L609
Is the job given enough time to save the checkpoint? Do you have a minimal reproducer?
Closing this due to inactivity.