Luka Govedič

proexpertprog.github.io

Red Hat Cambridge, MA vLLM performance @redhatofficial | MIT B.S. 2022, M. Eng. 2023 | HPC, C++ & CUDA

Results 97 comments of


                                            Luka Govedič

[ROCm][V0][Attention] Revert to the previous FA triton kernel

> Besides, between #15734 and #12591, the Triton FA in ROCmBackend code path is broken (as spotted in https://github.com/vllm-project/vllm/pull/17235) Yes, these were supposed to be merged in the opposite order,...

Enable Pydantic mypy checks and convert configs to Pydantic dataclasses

Getting this error on Python 3.10 after this PR: ``` Traceback (most recent call last): File "/home/luka/neuralmagic-vllm/examples/offline_inference/basic/generate.py", line 3, in from vllm import LLM, EngineArgs File "/home/luka/neuralmagic-vllm/vllm/__init__.py", line 12, in...

GPU Model Runner V2

> * For some reason I had to put [this line](https://github.com/vllm-project/vllm/blob/be22bb6f3dd7aaf8559a4a0a1beb98a37a5a8138/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py#L204) within a `with torch.no_grad():`, otherwise I got an error like `RuntimeError: sum(): functions with out=... arguments don't support automatic...

[Kernel] Build flash-attn from source

/ready

[Kernel] Build flash-attn from source

@youkaichao I updated the import statement and error message. But we could also just get rid of the optional import, as vllm-flash-attn should always get included/built - let me know...

Add option to disable weakref conversion for last piecewise cudagraph in a module

@sarckk sorry to leave this hanging for so long - but could you fix the merge conflicts so we can merge the PR?

Add option to disable weakref conversion for last piecewise cudagraph in a module

More conflicts, could you rebase again? Sorry for the delayed review

Add option to disable weakref conversion for last piecewise cudagraph in a module

precommit?

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter

I'm currently overhauling custom op matching in #24604. We also recently added a torch implementation of group quant, could you compare its performance with AITER? Also could you compare the...

[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1

(I asked @russellb to disable auto-merge until we get to the bottom of the performance numbers here)

‹
1
2
3
4
5
6
7
8
9
10
›