Lucas Wilkinson comments

Results 67 comments of


                                            Lucas Wilkinson

[Metrics] add more metrics

@HarryWu99 thanks for putting this PR together, Im interested in some of the metrics here too, it looks like it was approved with auto-merge enabled which means it will merge...

[Metrics] add more metrics

@HarryWu99 thanks for updating the PR, a lot the tests can be flaky, I re-ran some of them to see if its just flakiness, although Im not familiar with the...

[BUG] Cutlass Python API silently fails in (suspected) unsupported case

Still broken on: ``` nvidia-cutlass==3.5.1.0 ```

[QST] how to use groupwise scaling along M for FP8 gemm to impelement per-token-per-128-channel and blockwise?

PR opened: https://github.com/NVIDIA/cutlass/pull/2095

[QST] how to use groupwise scaling along M for FP8 gemm to impelement per-token-per-128-channel and blockwise?

> I have one little question about scale layouts - why scaleA has a stride like (1, M)? Will this layout improve the copying of scaleA in cutlass? Yes, just...

[NVIDIA] Support Cutlass MLA for Blackwell GPUs

Landing to help Blackwell perf but would like to follow up on: https://github.com/vllm-project/vllm/pull/16032#discussion_r2061603794 in a future PR potentially

[NVIDIA] Add Cutlass MLA backend

> Do you know any commend to test a model with num_heads = 128? And probably no TP. Not that im aware of :/ this is the smallest MLA model...

[NVIDIA] Add Cutlass MLA backend

Ah I don't think it's an MLA model :/ ``` "kv_lora_rank": null, ... "use_mla": false, ```

[NVIDIA] Add Cutlass MLA backend

> > other than I do think we should turn it on by default for Blackwell, Any reason not to? > > My main concern is that the CUTLASS MLA...

[NVIDIA] Add Cutlass MLA backend

> > Edit: oh and ideally id still like to see accuracy numbers... > > @LucasWilkinson this `DeepSeek-V2-Lite-Chat` only has attention head number == 16 and --tp=2 is not ok....