Jinzhen Lin comments

Results 39 comments of


                                            Jinzhen Lin

[Kernel] add bfloat16 support for gptq marlin kernel

@alexm-nm I have restructured code. Can you review it again.

[Kernel] add bfloat16 support for gptq marlin kernel

@alexm-nm @bnellnm All previous comments have been fixed. As for test in `test_gptq_marlin.py`: - since the naive gptq kernel doesn't support bf16 yet (https://github.com/vllm-project/vllm/pull/4781), I compare the outputs of gptq-marlin-bf16...

remove `torch.cuda.is_available()` check when compiling ops

@tjruwase Exactly I mean compiling cuda ops on a machine without gpu. But the CI doesn't build ops. In the mentioned issue, we encountered an error since the `quantizer` op...

remove `torch.cuda.is_available()` check when compiling ops

@tjruwase Sorry for absence of cpu builds checking before PR. I notice that the cpu-only target environments was introduced recently (after v0.8.0) and deepspeed is mainly for gpu now. So...

remove `torch.cuda.is_available()` check when compiling ops

@microsoft-github-policy-service agree

Compilation error for 0.8.1 with CUDA 11.2

I met the same issue when compiling version 0.8.3. And after debugging, I found the reason. I built the deepspeed on a machine without gpu, so the nvcc compilation arguments...

[Kernel] optimize performance of gptq marlin kernel when n is small

Benchmark results of deepseek-v3-awq on 8*A800 (tokens/s): bs | https://github.com/vllm-project/vllm/pull/13321 | https://github.com/vllm-project/vllm/pull/13321 + this PR -- | -- | -- 1 | 46.2 | 50.2 2 | 79.1 | 84.1...

[Kernel] optimize performance of gptq marlin kernel when n is small

@mgoin All tests are passed, can you merge it ?

[Kernel] some optimizations for dense marlin and moe marlin

dense marlin benchmark tests (on A800) ![image](https://github.com/user-attachments/assets/e7edfc38-ac8a-4836-bd65-267602f5c992) ![image](https://github.com/user-attachments/assets/f56add99-0990-468c-9cff-d3395f8e17d4) ![image](https://github.com/user-attachments/assets/f53a6878-d5f9-4890-9e53-899e915301a1)

[Kernel] some optimizations for dense marlin and moe marlin

moe marlin benchmark tests (on A800) (**NOTE1**: The optimization methods introduced in this PR have already been implemented in https://github.com/vllm-project/vllm/pull/14447 for cases where k