[REQUEST] Replace reduce in ZERO 1/2/3 with reduce_scatter
In ZeRO 1/2/3 with unchanged computational graph(ignore the change of shape), replacing all_reduce with reduce_scatter and carefully dividing the optimizer parameters can reduce the time of reduction operation.
@efsotr, can you please provide more details for your request. Currently, reduce_scatter is the default option for ZeRO 1/2/3
@tjruwase
https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1054C2-L1146C1
If reduce_scatter is True, program will enter in allreduce_no_retain (L1141) (default value of use_multi_rank_bucket_allreduce is False)
allreduce_no_retain (L1549) will call allreduce_and_copy with rank being not None allreduce_and_copy (L1526) will call allreduce_bucket with rank being not None allreduce_bucket (L1488) will call dist.reduce when rank is not None So my revised request is that replace dist.reduce with dist.reduce_scatter by curated optimizer states partitioning.
@efsotr, thanks for providing this clarification. You are right that there is a problem because reduce_scatter is unnecessarily conditioned on use_multi_rank_bucket_allreduce.
We will put this on TODO for fixing. In the meantime, can you validate that setting use_multi_rank_bucket_allreduce=True to enable reduce_scatter works correctly for your scenario?
@tjruwase use_multi_rank_bucket_allreduce=True: will call allreduce_and_scatter allreduce_and_scatter (L1016) will call allreduce_and_copy_with_multiple_ranks allreduce_and_copy_with_multiple_ranks (L1004) will call allreduce_bucket with default value of rank allreduce_bucket (L1488) will call dist.all_reduce when rank is None
So my revised request is that replace dist.reduce with dist.reduce_scatter by curated optimizer states partitioning.
@efsotr, I am not sure this the right thing to do here. Let me get help from @RezaYazdaniAminabadi who has more knowledge here.
Hi @efsotr,
That's right that we only use all_reduce with having use_multi_rank_bucket_allreduce set to True (that's because based on my profiling, I didn't see any benefit of using reduce-scatter instead of all-reduce in my benchmarks). However, by default, we should be using reduce_scatter, and the rank should be passed here in this function: https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1538, when calling allreduce_no_retain here: https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1141.
I guess what has confused you is the name convention, otherwise, we should be doing reduce-scatter and the else part will be selected here: https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1511. @tjruwase can confirm too?
Best, Reza
@RezaYazdaniAminabadi, thanks for the clarification. So, if I understand correctly, allreduce vs reduce_scatter is determined by whether rank==None in the following logic?
https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1506-L1511
@efsotr, can you please confirm if your concerns are now addressed.
I do agree that code readability needs improvement.