MuonClip for Kimi-K2 model

Open nvMelissa opened this issue 3 months ago • 0 comments

Is your feature request related to a problem? Please describe. This is not related to a problem, it is a feature request to expand model coverage

Describe the solution you'd like TE will work to add the following three functionalities in order to support Kimi-K2 model, popular in Chinese market (ANT, Tencent, JD, RedNote, Moonshot, Xiaomi).

Expose max logit in fused attention: The “clip” part of the optimizer Kimi-K2 use scales QKV proj weight based on the per head maximum value out of BMM1. My understanding is row max is collected for backward (in flash attention type of implementation) but not exposed. We could use the row max to implement MuonClip <<< Will be developed by TE and cuDNN contributors
SYRK kernels: Matrix multiply its own transpose get a symmetric matrix, in which half of the flops can be saved. Currently written in Triton <<< Will be developed by NVIDIA dev techs
“fused” implementation: There are still a lot of development in the field, it is not as stable as AdamW. Eventually when it becomes stable, TE maybe the right place to provide an optimized version, similar as FusedAdam <<< Will be developed by NVIDIA dev techs

The first 2 tasks can happen in parallel and are pre-requisites for the 3rd task

Describe alternatives you've considered N/A

**Add any other context or screenshots about the feature request here. The first 2 tasks can happen in parallel and are pre-requisites for the 3rd task

Nov 02 '25 21:11 nvMelissa