DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Question: how to continue the training with more or fewer GPUs

Open amsword opened this issue 4 years ago • 3 comments

If there are N GPUs, the snapshot will be N files for optimizer states. Each file corresponds to 1 GPU. (let me know if the understanding is not correct). Then, how to continue the training with more GPU, say, 2N GPUs? Is there an easy way to consolidate the optimizer states?

amsword avatar Jan 24 '22 18:01 amsword

Yes, the N optimizer state files correspond to the N GPUs. We are working on a feature to support changing the number of GPUs between training runs. Can you try setting the following parameter to true in your zero config? image

tjruwase avatar Jan 24 '22 21:01 tjruwase

Are there any resources on how to do this manually?

itsnamgyu avatar Apr 08 '24 14:04 itsnamgyu

@itsnamgyu, please see in-development feature called Universal Checkpointing https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/universal_checkpointing/README.md

tjruwase avatar Apr 15 '24 22:04 tjruwase