DeepSpeed Fix expert grad scaling problem with ZeRO optimizer

Fix [#6545]

work:

expert gradient average: divide edp_world_size -> divide dp_world_size
unit test: make sure model with different dp/ep has same expert gradient

Sep 17 '24 15:09 wyooyw

@microsoft-github-policy-service agree

Sep 17 '24 15:09 wyooyw

@wyooyw It seems that you should also delete or comment https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1072 when you delete https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1079

Sep 18 '24 01:09 ranzhejiang

@wyooyw It seems that you should also delete or comment https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1072 when you delete https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1079

Thank you for your suggestion. This redundant line of code has been removed.

Sep 18 '24 02:09 wyooyw