Jinzhen Lin
Jinzhen Lin
@alexm-nm I have restructured code. Can you review it again.
@alexm-nm @bnellnm All previous comments have been fixed. As for test in `test_gptq_marlin.py`: - since the naive gptq kernel doesn't support bf16 yet (https://github.com/vllm-project/vllm/pull/4781), I compare the outputs of gptq-marlin-bf16...
@tjruwase Exactly I mean compiling cuda ops on a machine without gpu. But the CI doesn't build ops. In the mentioned issue, we encountered an error since the `quantizer` op...
@tjruwase Sorry for absence of cpu builds checking before PR. I notice that the cpu-only target environments was introduced recently (after v0.8.0) and deepspeed is mainly for gpu now. So...
@microsoft-github-policy-service agree
I met the same issue when compiling version 0.8.3. And after debugging, I found the reason. I built the deepspeed on a machine without gpu, so the nvcc compilation arguments...
Benchmark results of deepseek-v3-awq on 8*A800 (tokens/s): bs | https://github.com/vllm-project/vllm/pull/13321 | https://github.com/vllm-project/vllm/pull/13321 + this PR -- | -- | -- 1 | 46.2 | 50.2 2 | 79.1 | 84.1...
@mgoin All tests are passed, can you merge it ?
dense marlin benchmark tests (on A800)   
moe marlin benchmark tests (on A800) (**NOTE1**: The optimization methods introduced in this PR have already been implemented in https://github.com/vllm-project/vllm/pull/14447 for cases where k