Lucas Wilkinson
Lucas Wilkinson
@HarryWu99 thanks for putting this PR together, Im interested in some of the metrics here too, it looks like it was approved with auto-merge enabled which means it will merge...
@HarryWu99 thanks for updating the PR, a lot the tests can be flaky, I re-ran some of them to see if its just flakiness, although Im not familiar with the...
Still broken on: ``` nvidia-cutlass==3.5.1.0 ```
PR opened: https://github.com/NVIDIA/cutlass/pull/2095
> I have one little question about scale layouts - why scaleA has a stride like (1, M)? Will this layout improve the copying of scaleA in cutlass? Yes, just...
Landing to help Blackwell perf but would like to follow up on: https://github.com/vllm-project/vllm/pull/16032#discussion_r2061603794 in a future PR potentially
> Do you know any commend to test a model with num_heads = 128? And probably no TP. Not that im aware of :/ this is the smallest MLA model...
Ah I don't think it's an MLA model :/ ``` "kv_lora_rank": null, ... "use_mla": false, ```
> > other than I do think we should turn it on by default for Blackwell, Any reason not to? > > My main concern is that the CUTLASS MLA...
> > Edit: oh and ideally id still like to see accuracy numbers... > > @LucasWilkinson this `DeepSeek-V2-Lite-Chat` only has attention head number == 16 and --tp=2 is not ok....