NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

resume training will using more memory

Open sophiayyya opened this issue 3 months ago • 6 comments

resume training will using more memory than before. Using 1.7B model, there will be more 1.7G memory per gpu. TP2. This is exactly half of model size. That is to say, resume will load extra a copy of model with only weight. megatron id 345080 nemo id 70b2dd nvcr.io/nvidia/nemo:25.07.01.

gpt_sft.py run_finetune.sh

Customer need to keep ckpt_async_save to True to reduce checkpoint saving time.

Do you have any suggestion about this?

sophiayyya avatar Oct 28 '25 16:10 sophiayyya

Hi few questions:

  1. what is the "before"? is that the previous release container or some commits?
  2. resume also load optimizer states, is that considered?
  3. I wonder what's the current behavior and expected behavior - can you be more specific?

yaoyu-33 avatar Oct 28 '25 17:10 yaoyu-33

Thanks for reply.

  1. Resume training will use more GPU memory than training from scratch. They use 1.7G model to test and find that extra 1.7G will be used per GPU. And 1.7G is half of the size of model weight with TP2. They suspect that resume training will load extra weight.
  2. Because the extra memory usage is the result compared with training from scratch, the root cause is unrelated to optimizer state.
  3. They expect that the memory usage when resuming training is the same as memory usage in training.

sophiayyya avatar Nov 04 '25 16:11 sophiayyya

Got it. I think internally we observe something similar. We will check on our end.

yaoyu-33 avatar Nov 05 '25 02:11 yaoyu-33

Hi @yaoyu-33 Do you have any insight about this bug? Customer find that it will appear occasionally.

sophiayyya avatar Nov 12 '25 16:11 sophiayyya

Hi @yaoyu-33 Do you have more insight on this bug? Thanks.

sophiayyya avatar Nov 21 '25 05:11 sophiayyya

@sophiayyya Out of curiosity, would the customer be open to using megatron-bridge considering most eng BW has moved over there? Example pretrain script https://github.com/NVIDIA-NeMo/Megatron-Bridge?tab=readme-ov-file#launching-recipes

terrykong avatar Nov 21 '25 07:11 terrykong

Closing this due to inactivity.

oyilmaz-nvidia avatar Dec 10 '25 18:12 oyilmaz-nvidia