Why is Pipeline parallelism not compatible with ZeRO-2 and ZeRO-3?
Could you explain why Pipeline parallelism is not compatible with ZeRO-2 and ZeRO-3? Are there any design tradeoffs?
Because as far as I know, it is pretty common to train large models with both DataParallel and PipelineParallel together. and with the constraint above, the offload mechanism cannot be enabled due to its dependency on ZeRO-2/3.
Also, Megatron-DeepSpeed:pretrain_gpt.py use GPTModelPipe, which is a subclass of PipelineModule as the model module passed to deepspeed.initialize(), so it's impossible to enable ZeRO-2/3 in the config json, is there any examples to run with ZeRO-2/3?
Some relevant information are in https://github.com/microsoft/DeepSpeed/issues/1110
ZeRO Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
How is that not pipeline parallelism?