DeepSpeed
DeepSpeed copied to clipboard
MoE - Token dropping for Full Tensor Paralellism
This PR enables token dropping for full tensor parallelism. Also corrects timers.
(Still WIP)