Xiaowei Ren

Results 36 comments of Xiaowei Ren

Hi @zhaoyinglia Thanks for reaching out. Could you please point me to the code of the following? > In current code logic, the loss scaling with cp_size, but grad_data scaling...

Hi @i4never Thanks for submitting the PR! Have you done any E2E training test with your PR? There is some case where communication can take longer time than compute, your...

Hi @SuperCB You mean Multi-head Latent attention which is used by Deepseek? Technically, nothing should stop us from doing it, we just have not done it yet. Considering popularity of...

Yeah, A2A implementation probably can work with MLA out of the box. `AttnFuncWithCPAndKVAllGather` might work for MLA also. P2P cannot work because it concats K and V into a single...