[REQUEST] what‘s the difference of pipeline Parallelism between deepspeed and megatron?
what‘s the difference of pipeline Parallelism between deepspeed and megatron?
They are mostly identical. The megatron implementation is tightly wedded to Megatron-LM, so you cannot use it elsewhere easily. DS's implementation is modular, so you could parallelize other workloads outside of Megatron-DeepSpeed as well.
One difference is that Megatron offers another optimization called 'interleaved/virtual pipelining' which can be enabled by using this argument - https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/arguments.py#L1097.
@siddharth9820 thanks for your reply,I have another question about the code,
if not fp16_master_weights_and_gradients: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().float().detach()) else: self.single_partition_of_fp32_groups.append(self.parallel_partitioned_bit16_groups[i][partition_id].to( self.device).clone().half().detach()) self.single_partition_of_fp32_groups[ i].requires_grad = True # keep this in case internal optimizer uses it
single_partition_of_fp32_groups is detached ,
and then set single_partition_of_fp32_groups grad enable,
I feel confused about the code , why the tensor is detached ,and set grad enable again?