resume training will using more memory
resume training will using more memory than before. Using 1.7B model, there will be more 1.7G memory per gpu. TP2. This is exactly half of model size. That is to say, resume will load extra a copy of model with only weight. megatron id 345080 nemo id 70b2dd nvcr.io/nvidia/nemo:25.07.01.
Customer need to keep ckpt_async_save to True to reduce checkpoint saving time.
Do you have any suggestion about this?
Hi few questions:
- what is the "before"? is that the previous release container or some commits?
- resume also load optimizer states, is that considered?
- I wonder what's the current behavior and expected behavior - can you be more specific?
Thanks for reply.
- Resume training will use more GPU memory than training from scratch. They use 1.7G model to test and find that extra 1.7G will be used per GPU. And 1.7G is half of the size of model weight with TP2. They suspect that resume training will load extra weight.
- Because the extra memory usage is the result compared with training from scratch, the root cause is unrelated to optimizer state.
- They expect that the memory usage when resuming training is the same as memory usage in training.
Got it. I think internally we observe something similar. We will check on our end.
Hi @yaoyu-33 Do you have any insight about this bug? Customer find that it will appear occasionally.
Hi @yaoyu-33 Do you have more insight on this bug? Thanks.
@sophiayyya Out of curiosity, would the customer be open to using megatron-bridge considering most eng BW has moved over there? Example pretrain script https://github.com/NVIDIA-NeMo/Megatron-Bridge?tab=readme-ov-file#launching-recipes
Closing this due to inactivity.