[BUG] DeepSpeedZeroOptimizer_Stage3: It cannot reduce the gradients remained in the bucket timely
Describe the bug After each step, when gradient_accumulation_steps is set to be 1 and in the end of each step, __reduce_and_partition_ipg_grads should be invoked to reduce the remainning gradients in the bucket. Right now, it does not. It can cause the following issue: At the end of each step, only partial gradients are reduced, while the remained gradients are not reduced.
To Reproduce You could refer to independent_gradient_partition_epilogue, which is not invoked. Also, there is another bug that gradient_accumulation_steps is not used in stage3.py. It is easy to reproduce, which you can use any model to find out the issue.
Expected behavior At the beginning of each step, "self.elements_in_ipg_bucket" should be 0, while it is not right now.