NeMo resume training will using more memory

resume training will using more memory than before. Using 1.7B model, there will be more 1.7G memory per gpu. TP2. This is exactly half of model size. That is to say, resume will load extra a copy of model with only weight. megatron id 345080 nemo id 70b2dd nvcr.io/nvidia/nemo:25.07.01.

gpt_sft.py run_finetune.sh

Customer need to keep ckpt_async_save to True to reduce checkpoint saving time.

Do you have any suggestion about this?

Oct 28 '25 16:10 sophiayyya

Hi few questions:

what is the "before"? is that the previous release container or some commits?
resume also load optimizer states, is that considered?
I wonder what's the current behavior and expected behavior - can you be more specific?

Oct 28 '25 17:10 yaoyu-33

Thanks for reply.

Resume training will use more GPU memory than training from scratch. They use 1.7G model to test and find that extra 1.7G will be used per GPU. And 1.7G is half of the size of model weight with TP2. They suspect that resume training will load extra weight.
Because the extra memory usage is the result compared with training from scratch, the root cause is unrelated to optimizer state.
They expect that the memory usage when resuming training is the same as memory usage in training.

Nov 04 '25 16:11 sophiayyya

Got it. I think internally we observe something similar. We will check on our end.

Nov 05 '25 02:11 yaoyu-33

Hi @yaoyu-33 Do you have any insight about this bug? Customer find that it will appear occasionally.

Nov 12 '25 16:11 sophiayyya

Hi @yaoyu-33 Do you have more insight on this bug? Thanks.

Nov 21 '25 05:11 sophiayyya

@sophiayyya Out of curiosity, would the customer be open to using megatron-bridge considering most eng BW has moved over there? Example pretrain script https://github.com/NVIDIA-NeMo/Megatron-Bridge?tab=readme-ov-file#launching-recipes

Nov 21 '25 07:11 terrykong

Closing this due to inactivity.

Dec 10 '25 18:12 oyilmaz-nvidia