DeepSpeed [REQUEST] Replace reduce in ZERO 1/2/3 with reduce

In ZeRO 1/2/3 with unchanged computational graph(ignore the change of shape), replacing all_reduce with reduce_scatter and carefully dividing the optimizer parameters can reduce the time of reduction operation.

Aug 30 '24 07:08 efsotr

@efsotr, can you please provide more details for your request. Currently, reduce_scatter is the default option for ZeRO 1/2/3

Aug 30 '24 13:08 tjruwase

@tjruwase

https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1054C2-L1146C1

If reduce_scatter is True, program will enter in allreduce_no_retain (L1141) (default value of use_multi_rank_bucket_allreduce is False)

Aug 30 '24 14:08 efsotr

allreduce_no_retain (L1549) will call allreduce_and_copy with rank being not None allreduce_and_copy (L1526) will call allreduce_bucket with rank being not None allreduce_bucket (L1488) will call dist.reduce when rank is not None So my revised request is that replace dist.reduce with dist.reduce_scatter by curated optimizer states partitioning.

Aug 30 '24 14:08 efsotr

@efsotr, thanks for providing this clarification. You are right that there is a problem because reduce_scatter is unnecessarily conditioned on use_multi_rank_bucket_allreduce.

We will put this on TODO for fixing. In the meantime, can you validate that setting use_multi_rank_bucket_allreduce=True to enable reduce_scatter works correctly for your scenario?

Aug 30 '24 14:08 tjruwase

@tjruwase use_multi_rank_bucket_allreduce=True: will call allreduce_and_scatter allreduce_and_scatter (L1016) will call allreduce_and_copy_with_multiple_ranks allreduce_and_copy_with_multiple_ranks (L1004) will call allreduce_bucket with default value of rank allreduce_bucket (L1488) will call dist.all_reduce when rank is None

Aug 30 '24 14:08 efsotr

So my revised request is that replace dist.reduce with dist.reduce_scatter by curated optimizer states partitioning.

@efsotr, I am not sure this the right thing to do here. Let me get help from @RezaYazdaniAminabadi who has more knowledge here.

Aug 30 '24 14:08 tjruwase

Hi @efsotr,

That's right that we only use all_reduce with having use_multi_rank_bucket_allreduce set to True (that's because based on my profiling, I didn't see any benefit of using reduce-scatter instead of all-reduce in my benchmarks). However, by default, we should be using reduce_scatter, and the rank should be passed here in this function: https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1538, when calling allreduce_no_retain here: https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1141. I guess what has confused you is the name convention, otherwise, we should be doing reduce-scatter and the else part will be selected here: https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1511. @tjruwase can confirm too?

Best, Reza

Aug 30 '24 18:08 RezaYazdaniAminabadi

@RezaYazdaniAminabadi, thanks for the clarification. So, if I understand correctly, allreduce vs reduce_scatter is determined by whether rank==None in the following logic?

https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1506-L1511

@efsotr, can you please confirm if your concerns are now addressed.

I do agree that code readability needs improvement.

Sep 03 '24 14:09 tjruwase

[REQUEST] Replace reduce in ZERO 1/2/3 with reduce_scatter