DeepSpeed Pipeline parallel support for multi-node training?

Hello DeepSpeed :)

I am trying to use Pipeline module to train a pipeline parallel model on multiple nodes. I am using Slurm as the cluster scheduler, so I initialized the following ENV variables according to Slurm configuration as below, and observed that the model layers get partitioned well and each partition gets placed on correct devices.

# Initializing distributed process group 

os.environ['MASTER_ADDR'] = f'{slurm_handler.master_addr}' # host address of root process
os.environ['MASTER_PORT'] = f'{slurm_handler.master_port}' # free master port of the above host 
os.environ['RANK'] = os.environ['SLURM_PROCID'] # global rank
os.environ['LOCAL_RANK'] = '0' # since Slurm assigns one device per process, each process recognize its assigned device with local rank 0

deepspeed.init_distributed(dist_backend=args.backend)

However, when I call deepspeed.initialize, the processes in the first node hangs waiting for the processes in the second node.

net = PipelineModule(layers=model_ds.to_layers(),
                     loss_fn=model_ds.loss_fn, num_stages=pp_stage)

### Entrypoint for training w/ DeepSpeed
# TODO: Hangs at p2p.init_process_groups (https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/engine.py)
engine, _, _, _ = deepspeed.initialize(
    args=args,
    model=net,
    model_parameters=[p for p in net.parameters() if p.requires_grad],
    optimizer=optimizer_ds)

I suspect it is because of this L152 in PipelineEngine which initialize p2p communication among the group. So I am wondering whether DeepSpeed pipeline module supports pipeline parallel training using multiple nodes.

#initialize peer-2-peer communication and allreduce groups
if self.is_pipe_parallel:
    p2p.init_process_groups(self.grid)

If it does, please give me an advice on where I might have overlooked. Thanks!

Feb 17 '23 02:02 gajagajago

I used deepspeed to launch with hostfiles, maybe this can solve your problem.

Apr 28 '23 20:04 sharlec

Have you already solved this problem? If any, please let me know your solution. Thank you so much! @gajagajago

Oct 30 '23 03:10 BastianChen

@CChBen Sorry, no solutions under DeepSpeed implementation. Seems PP is just a naive-support only feature in DeepSpeed since their main functionality is ZeRO. However, I am currently developing a pipeline parallel project that include the feature you want to execute! I will let you know when it is released.

Nov 01 '23 03:11 gajagajago

@gajagajago Any update about your project?

Mar 14 '24 12:03 puppet101

@puppet101 Comming up in few weeks now. I will post the link here soon

Mar 14 '24 15:03 gajagajago