Can we run data parallelism together with pipeline parallelism on GPU?

Open StevenShi-23 opened this issue 4 years ago • 1 comments

Hi lingvo contributors,

Thanks for the prompt response to my previous ticket on docker version.

I want to run gpipe together with data parallelism on an 8x GPU server. I searched around and found that num_splits_per_client at program.py:80 seemed to determine the level of DP for TPU trainer.

    self.data_parallelism = p.num_splits_per_client

I set it to 2 expecting that it will run on two DP workers, each of them having 4 pipeline stages with GPipe. I used 1 GPU per stage, therefore I used 8 GPUs in total (2 DP x 4 PP). However, I observed that only the first 4 GPUs were active while the last 4 GPUs were idling, and the throughput was similar to that of 4-GPU non-DP baseline. I suspect that this parameter is not taking effect. A screenshot of GPU trace is attached below.

I would like to ask what is the correct way to set the parameters of DP using lingvo? And if possible, would you please provide some examples on using DP+Pipeline, perhaps in the run_distributed.py format from /docker?

Thank you!

Jan 12 '22 03:01 StevenShi-23

GPU trace:

Screenshot 2022-01-12 at 11 52 37

Jan 12 '22 03:01 StevenShi-23