Fix expert grad scaling problem with ZeRO optimizer
Fix [#6545]
work:
- expert gradient average: divide edp_world_size -> divide dp_world_size
- unit test: make sure model with different dp/ep has same expert gradient
@microsoft-github-policy-service agree
@wyooyw It seems that you should also delete or comment https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1072 when you delete https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1079
@wyooyw It seems that you should also delete or comment https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1072 when you delete https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1079
Thank you for your suggestion. This redundant line of code has been removed.