[BUG]DeepSpeed zero3 use more gpu memory than zero2
Describe the bug
- DeepSpeed zero3 use more gpu memory than zero2.
- Ulysess performance problem.
To Reproduce Steps to reproduce the behavior: just use ulysess sample code: https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_1.3B_seq_parallel_32k.sh
change the model size into 7b, use 8xA100-80g, gbs=16, mbs=1, dp=2, sp=4
for deepspeed: dp2+sp4:
- zero3, recompute: 52.8G, 13000 ms
- zero2, recompute: 43.8G, 13194 ms
- zero3, no-recompute: 75.2G, 10329 ms
- zero2, no-recompute: 62.3G, 10050 ms
for tensor parallel: dp2+tp4:
- recompute: 32.5G, 16128 ms
- no-recompute: 45G, 12109 ms
I'm so confused that:
- why zero3 use more gpu memory than zero2?
- it seems if do not use checkpoint-activations, use the same dp, ulysess will oom earlier than megatron tp?
My DeepSpeed zero1/2/3 + offload use more gpu memory than DDP
@TheDecisionJTree Can you share your exact deepspeed config or the ds_pretrain ....sh script ?
@TheDecisionJTree Can you share your exact deepspeed config or the ds_pretrain ....sh script ?
OK { "optimizer": { "type": "AdamW", "params": { "lr": 0.001, "betas": [ 0.8, 0.999 ], "eps": 1e-08, "weight_decay": 3e-07 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 0.001, "warmup_num_steps": 1000 } }, "zero_optimization": { "stage": 1, "offload_optimizer": { "device": "cpu", "pin_memory": true } }, "gradient_accumulation_steps": 1, "gradient_checkpointing": 1, "gradient_clipping": 1.0, "train_batch_size": 8, "train_micro_batch_size_per_gpu": 1, "steps_per_print": 1e5 }