Can we run data parallelism together with pipeline parallelism on GPU?
Hi lingvo contributors,
Thanks for the prompt response to my previous ticket on docker version.
I want to run gpipe together with data parallelism on an 8x GPU server. I searched around and found that num_splits_per_client at program.py:80 seemed to determine the level of DP for TPU trainer.
self.data_parallelism = p.num_splits_per_client
I set it to 2 expecting that it will run on two DP workers, each of them having 4 pipeline stages with GPipe. I used 1 GPU per stage, therefore I used 8 GPUs in total (2 DP x 4 PP). However, I observed that only the first 4 GPUs were active while the last 4 GPUs were idling, and the throughput was similar to that of 4-GPU non-DP baseline. I suspect that this parameter is not taking effect. A screenshot of GPU trace is attached below.
I would like to ask what is the correct way to set the parameters of DP using lingvo? And if possible, would you please provide some examples on using DP+Pipeline, perhaps in the run_distributed.py format from /docker?
Thank you!
GPU trace:
