DeepSpeed [REQUEST] what‘s the difference of pipeline Parallelism between deepspeed and megatron?

what‘s the difference of pipeline Parallelism between deepspeed and megatron?

Dec 12 '23 02:12 mollon650

They are mostly identical. The megatron implementation is tightly wedded to Megatron-LM, so you cannot use it elsewhere easily. DS's implementation is modular, so you could parallelize other workloads outside of Megatron-DeepSpeed as well.

One difference is that Megatron offers another optimization called 'interleaved/virtual pipelining' which can be enabled by using this argument - https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/arguments.py#L1097.

Dec 15 '23 07:12 siddharth9820

@siddharth9820 thanks for your reply，I have another question about the code， if not fp16_master_weights_and_gradients: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().float().detach()) else: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().half().detach()) self.single_partition_of_fp32_groups[ i].requires_grad = True # keep this in case internal optimizer uses it single_partition_of_fp32_groups is detached ， and then set single_partition_of_fp32_groups grad enable， I feel confused about the code ， why the tensor is detached ，and set grad enable again？

Dec 18 '23 09:12 mollon650