Megha Agarwal
Results
2
issues of
Megha Agarwal
As the title suggests, this PR removes TP (tensor parallelism) for MoE router. Duplicating router across GPUs removes an allreduce for each MoE layer. This small change leads to **4-18%...
This PR is an implementation for RFC #6913 . Follow up PR to implement a cleaner solution which does not rely on callback. Following changes were made: 1. Scheduler states'...