Results 10 issues of HandH1998

老师,这个position embedding的实现我觉得有点问题,请您看看

I build lightseq on cuda11.4 successfully. Then I do llama-13B inference test on A100-80G. I set max_step=1024. When max_batch_size = 11, _lightseq/csrc/ops_new/sampling.cc.cu(73): an illegal memory access was encountered._ And I...

Thanks for your wonderful work! I am trying to understand matrix A's layout in shared memory. I think A's shape is `(16 * thread_m_blocks) * (16 * thread_k_blocks)` in shared...

### Anything you want to discuss about vllm. For gptq_marlin, `min_thread_n=64 min_thread_k=64` is required in [https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/csrc/quantization/gptq_marlin/gptq_marlin.cuh#L22-L23](url), while `min_thread_n=64 min_thread_k=128` is required in [https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L21-L22](url). Why the limitation is different?

misc

We have proposed a W4A8 quantization solution QQQ and integrated it into vllm. QQQ can not only achieve the similar performance of the leading W4A8, W8A8, and W4A16 quantization methods...

## Motivation We have implemented W4A8 quantization for the lmdeploy turbomind backend using our quantization algorithm [QQQ](https://github.com/HandH1998/QQQ) to enhance inference throughput. We hope that lmdeploy users will find this beneficial....

Hello, you guys did an excellent work! I supported w4a8 marlin based your code and achieved a considerable speedup. The main idea can refer to Section 3.3 of our paper...

## Bug There are three supported attention modes: `eager`, `sdpa`, and `flash_attention_2` for Qwen2 in transformers. I have evaluated wikitext2 PPL on the Qwen2-7b model under different attention modes. The...

Following https://github.com/sgl-project/sglang/pull/3047, we replace w8a8 fp8 vllm kernel with sgl-kernel. Generally, the w8a8 fp8 sgl-kernel yields higher accuracy on gsm8k. On sm89-L40, the w8a8 fp8 sgl-kernel delivers a 14% higher...

high priority

Speeding up cumsum in top_p sampling with upper triangular matrix multiplication. The idea originally comes from the 13th item of [https://zhuanlan.zhihu.com/p/673671771](url). As we know, torch.cumsum uses cuda core, while matmul...

needs-rebase
unstale