Yi Zhang comments

Results 39 comments of


                                            Yi Zhang

[Feature] Apply Cublas Grouped Gemm kernel

Since pytorch 2.5.1 only supports cuda12.4 in official docs, and we can not change pytorch version easily, we need to update doc to guide user to reinstall pytorch if they...

[Feature] Apply Cublas Grouped Gemm kernel

LGTM cc @zhyncs

[QST] how to use groupwise scaling along M for FP8 gemm to impelement per-token-per-128-channel and blockwise?

@xuzhenqi Still not，if you have any progress, please let me know, thank you very much!

[QST] how to use groupwise scaling along M for FP8 gemm to impelement per-token-per-128-channel and blockwise?

I see https://github.com/NVIDIA/cutlass/pull/2095 has merged, thanks a lot! @LucasWilkinson

[WIP] Support qwen2 vl model

> Thanks for the contributions. I left a few comments. > > We also did some refactoring recently (#1541, #1538). Could you rebase? Sorry for the late reply, I am...

[WIP] Support qwen2 vl model

> Thanks for the contributions. I left a few comments. > > We also did some refactoring recently (#1541, #1538). Could you rebase? OK, I have rebased code into lastet...

[WIP] Support qwen2 vl model

> Can this run correctly now without the modification/update of vllm? If so, we can remove "WIP" in the PR title and merge this soon! I think not, there are...

> It seems this PR is merged to [yizhang2077:support-qwen2-vl](https://github.com/yizhang2077/sglang/tree/support-qwen2-vl) by accident? Should we open a new one? It seems this PR is merge into qwen2vl branch，and when this PR #1711...

[Bug] Vision attention mask cache is never released and cause OOM

Hi @MagiaSN , #3657 seems to have address your issue, could you try it again? I close this issue, and if you still have the problem, you can reopen it...

Support qwen3 deepep

Do we need raise error for bf16 when enable deepep?