DeepSpeed [BUG]DeepSpeed zero3 use more gpu memory than zero2

Describe the bug

DeepSpeed zero3 use more gpu memory than zero2.
Ulysess performance problem.

To Reproduce Steps to reproduce the behavior: just use ulysess sample code: https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_1.3B_seq_parallel_32k.sh

change the model size into 7b, use 8xA100-80g, gbs=16, mbs=1, dp=2, sp=4

for deepspeed: dp2+sp4:

zero3, recompute: 52.8G, 13000 ms
zero2, recompute: 43.8G, 13194 ms
zero3, no-recompute: 75.2G, 10329 ms
zero2, no-recompute: 62.3G, 10050 ms

for tensor parallel: dp2+tp4:

recompute: 32.5G, 16128 ms
no-recompute: 45G, 12109 ms

I'm so confused that:

why zero3 use more gpu memory than zero2?
it seems if do not use checkpoint-activations, use the same dp, ulysess will oom earlier than megatron tp?

Mar 22 '24 09:03 Sakura-gh

My DeepSpeed zero1/2/3 + offload use more gpu memory than DDP

Apr 22 '24 07:04 TheDecisionJTree

@TheDecisionJTree Can you share your exact deepspeed config or the ds_pretrain ....sh script ?

Apr 22 '24 18:04 umchand

@TheDecisionJTree Can you share your exact deepspeed config or the ds_pretrain ....sh script ?

OK { "optimizer": { "type": "AdamW", "params": { "lr": 0.001, "betas": [ 0.8, 0.999 ], "eps": 1e-08, "weight_decay": 3e-07 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 0.001, "warmup_num_steps": 1000 } }, "zero_optimization": { "stage": 1, "offload_optimizer": { "device": "cpu", "pin_memory": true } }, "gradient_accumulation_steps": 1, "gradient_checkpointing": 1, "gradient_clipping": 1.0, "train_batch_size": 8, "train_micro_batch_size_per_gpu": 1, "steps_per_print": 1e5 }

Apr 23 '24 01:04 TheDecisionJTree