[BUG] Not see desirable GPU memory saving when running DeepSpeedExamples/training/pipeline_parallelism
Describe the bug Hi, I run DeepSpeedExamples/training/pipeline_parallelism.I run the code on 1 V100 with no pipeline, GPU memory requires approximately 2739M. But run the code on 2 V100 using pipelinemodule, one GPU memory requires approximately 2187M, the other is 1425M.Shouldn't each gpu be around 1400M?
To Reproduce one gpu: deepspeed --num_gpus=1 train.py --deepspeed_config=ds_config.json -p 0 --steps=20000 two gpus: deepspeed --num_gpus=2 train.py --deepspeed_config=ds_config.json -p 2 --steps=20000 Expected behavior Each gpu be around 1400M
ds_report output
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] async_io ............... [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/home/miniconda/envs/bl/lib/python3.8/site-packages/torch'] torch version .................... 1.13.1 torch cuda version ............... 11.6 torch hip version ................ None nvcc version ..................... 11.0 deepspeed install path ........... ['/home/miniconda/envs/bl/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.7.7, unknown, unknown deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6 Screenshots