DeepSpeed Why is Pipeline parallelism not compatible with ZeRO-2 and ZeRO-3?

Could you explain why Pipeline parallelism is not compatible with ZeRO-2 and ZeRO-3? Are there any design tradeoffs?

Because as far as I know, it is pretty common to train large models with both DataParallel and PipelineParallel together. and with the constraint above, the offload mechanism cannot be enabled due to its dependency on ZeRO-2/3.

Also, Megatron-DeepSpeed:pretrain_gpt.py use GPTModelPipe, which is a subclass of PipelineModule as the model module passed to deepspeed.initialize(), so it's impossible to enable ZeRO-2/3 in the config json, is there any examples to run with ZeRO-2/3?

Dec 10 '21 08:12 Dounm

Some relevant information are in https://github.com/microsoft/DeepSpeed/issues/1110

May 06 '23 17:05 2catycm

ZeRO Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.

How is that not pipeline parallelism?

May 06 '24 12:05 fxmarty