Pipeline parallel support for multi-node training?
Hello DeepSpeed :)
I am trying to use Pipeline module to train a pipeline parallel model on multiple nodes. I am using Slurm as the cluster scheduler, so I initialized the following ENV variables according to Slurm configuration as below, and observed that the model layers get partitioned well and each partition gets placed on correct devices.
# Initializing distributed process group
os.environ['MASTER_ADDR'] = f'{slurm_handler.master_addr}' # host address of root process
os.environ['MASTER_PORT'] = f'{slurm_handler.master_port}' # free master port of the above host
os.environ['RANK'] = os.environ['SLURM_PROCID'] # global rank
os.environ['LOCAL_RANK'] = '0' # since Slurm assigns one device per process, each process recognize its assigned device with local rank 0
deepspeed.init_distributed(dist_backend=args.backend)
However, when I call deepspeed.initialize, the processes in the first node hangs waiting for the processes in the second node.
net = PipelineModule(layers=model_ds.to_layers(),
loss_fn=model_ds.loss_fn, num_stages=pp_stage)
### Entrypoint for training w/ DeepSpeed
# TODO: Hangs at p2p.init_process_groups (https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/engine.py)
engine, _, _, _ = deepspeed.initialize(
args=args,
model=net,
model_parameters=[p for p in net.parameters() if p.requires_grad],
optimizer=optimizer_ds)
I suspect it is because of this L152 in PipelineEngine which initialize p2p communication among the group. So I am wondering whether DeepSpeed pipeline module supports pipeline parallel training using multiple nodes.
#initialize peer-2-peer communication and allreduce groups
if self.is_pipe_parallel:
p2p.init_process_groups(self.grid)
If it does, please give me an advice on where I might have overlooked. Thanks!
I used deepspeed to launch with hostfiles, maybe this can solve your problem.
Have you already solved this problem? If any, please let me know your solution. Thank you so much! @gajagajago
@CChBen Sorry, no solutions under DeepSpeed implementation. Seems PP is just a naive-support only feature in DeepSpeed since their main functionality is ZeRO. However, I am currently developing a pipeline parallel project that include the feature you want to execute! I will let you know when it is released.
@gajagajago Any update about your project?
@puppet101 Comming up in few weeks now. I will post the link here soon