Wu Jiawei comments

Results 6 comments of


                                            Wu Jiawei

Where do dispatch and combine need to be synchronized?

> > > What changes will occur in the end-to-end latency of each RANK? Can it be estimated as max(Dispatch latency) + Expert Group Gemm latency + max(Combine latency)? >...

[QUESTION]Does Megatron support tracing computation graphs with torch.fx?

> > Hello, I encountered a similar problem ,May I know your specific problem and how did you solve it? Thanks! > > Hello, I haven't found a good solution...

Allow using few SMs for low-latency mode

> As can be seen, when using 9 warpgroup - ie few SMs, the performance only slightly slow down. Thus this makes a simple overlapping between this and computation feasible....

Allow using few SMs for low-latency mode

> Deepep kernel uses less SM, so who will use the extra SM? For example, in the decode phase of sglang, the communication kernel and the compute kernel are serial....

When running the two-machine test_low_latency.py (EP16), there is a significant difference in the test results between two machine

> I discussed this issue with [@liuhe-spec](https://github.com/liuhe-spec) on WeChat, and we strongly suspect it is likely related to RoCE network congestion control. > > If possible, you can ask your...

[Feature] Per Expert Overlap (PEO)

Hi, to achieve Parameters for Overlap, should we modify in sglang forward and launch multi time gemm( to caculate different expert group ) ?