DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] what‘s the difference of pipeline Parallelism between deepspeed and megatron?

Open mollon650 opened this issue 2 years ago • 2 comments

what‘s the difference of pipeline Parallelism between deepspeed and megatron?

mollon650 avatar Dec 12 '23 02:12 mollon650

They are mostly identical. The megatron implementation is tightly wedded to Megatron-LM, so you cannot use it elsewhere easily. DS's implementation is modular, so you could parallelize other workloads outside of Megatron-DeepSpeed as well.

One difference is that Megatron offers another optimization called 'interleaved/virtual pipelining' which can be enabled by using this argument - https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/arguments.py#L1097.

siddharth9820 avatar Dec 15 '23 07:12 siddharth9820

@siddharth9820 thanks for your reply,I have another question about the code, if not fp16_master_weights_and_gradients: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().float().detach()) else: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().half().detach()) self.single_partition_of_fp32_groups[ i].requires_grad = True # keep this in case internal optimizer uses it single_partition_of_fp32_groups is detached , and then set single_partition_of_fp32_groups grad enable, I feel confused about the code , why the tensor is detached ,and set grad enable again?

mollon650 avatar Dec 18 '23 09:12 mollon650