streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Support tensor parallel/pipeline parallel

Open gongel opened this issue 2 years ago • 4 comments

Support tensor parallel/pipeline parallel currently?

gongel avatar Aug 25 '23 11:08 gongel

Can you please share more details ?

karan6181 avatar Aug 29 '23 14:08 karan6181

NVIDIA-Megatron team proposed "Tensor Parallelism". When training in "Tensor Parallelism", the rank in same group has same data. Paper: https://arxiv.org/pdf/2205.05198.pdf Repo: https://github.com/NVIDIA/Megatron-LM

But in streaming, you only support DDP/FSDP.

gongel avatar Aug 30 '23 11:08 gongel

Any plan to add this?

one easy solution that seems not to be working could be:

os.environ["WORLD_SIZE"] = str(os.environ["WORLD_SIZE"]  // model_parallel_size)
os.environ["RANK"] = str(os.environ["RANK"] // model_parallel_size)

I tried, but seems the code gets stuck after calling something like:

batch = next(batch_iterator)

where batch_iterator a dataloder.

cc: @karan6181

andreamad8 avatar Feb 09 '24 03:02 andreamad8

@snarayan21 Looks like this is being addressed. Is that right?

karan6181 avatar Feb 28 '24 16:02 karan6181

Would like to know if there is any example of megatron integration.

huxuan avatar Jun 12 '24 12:06 huxuan

@andreamad8 @huxuan @gongel please see the replication argument detailed in our docs here.

@huxuan We don't have an explicit example of a megatron integration, but as it's pytorch based, you can simply swap in the dataset / dataloader.

snarayan21 avatar Jul 23 '24 07:07 snarayan21