HandH1998 issues

Results 10 issues of


                                            HandH1998

I build lightseq on cuda11.4 successfully. Then I do llama-13B inference test on A100-80G. I set max_step=1024. When max_batch_size = 11, _lightseq/csrc/ops_new/sampling.cc.cu(73): an illegal memory access was encountered._ And I...

Questions about matrix A's layout in shared memory.

Thanks for your wonderful work! I am trying to understand matrix A's layout in shared memory. I think A's shape is `(16 * thread_m_blocks) * (16 * thread_k_blocks)` in shared...

[Misc]: Min thread limitation inconsistency for gptq_marlin

### Anything you want to discuss about vllm. For gptq_marlin, `min_thread_n=64 min_thread_k=64` is required in [https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/csrc/quantization/gptq_marlin/gptq_marlin.cuh#L22-L23](url), while `min_thread_n=64 min_thread_k=128` is required in [https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L21-L22](url). Why the limitation is different?

misc

Support W4A8 quantization for vllm

We have proposed a W4A8 quantization solution QQQ and integrated it into vllm. QQQ can not only achieve the similar performance of the leading W4A8, W8A8, and W4A16 quantization methods...

[Feature] support qqq(w4a8) for lmdeploy

## Motivation We have implemented W4A8 quantization for the lmdeploy turbomind backend using our quantization algorithm [QQQ](https://github.com/HandH1998/QQQ) to enhance inference throughput. We hope that lmdeploy users will find this beneficial....

Support w4a8 marlin gemm

Hello, you guys did an excellent work! I supported w4a8 marlin based your code and achieved a considerable speedup. The main idea can refer to Section 3.3 of our paper...

Fix: Qwen2 eager attention bug

## Bug There are three supported attention modes: `eager`, `sdpa`, and `flash_attention_2` for Qwen2 in transformers. I have evaluated wikitext2 PPL on the Qwen2-7b model under different attention modes. The...

Apply sgl w8a8 fp8 kernel

Following https://github.com/sgl-project/sglang/pull/3047, we replace w8a8 fp8 vllm kernel with sgl-kernel. Generally, the w8a8 fp8 sgl-kernel yields higher accuracy on gsm8k. On sm89-L40, the w8a8 fp8 sgl-kernel delivers a 14% higher...

high priority

optimize sampler cumsum

Speeding up cumsum in top_p sampling with upper triangular matrix multiplication. The idea originally comes from the 13th item of [https://zhuanlan.zhihu.com/p/673671771](url). As we know, torch.cumsum uses cuda core, while matmul...

needs-rebase

unstale

HandH1998

Update transformer.py

llama inference test