Jinzhen Lin
Jinzhen Lin
**Describe the bug** In notebook 6.4.12,we cannot rename file sometimes. **To Reproduce** Steps to reproduce the behavior: 1. Install notebook 6.4.12 2. Start notebook with `root_dir="/home/foobar"` 3. Create a directory...
`torch.cuda.is_available()` is not necessary here. And I would cause https://github.com/microsoft/DeepSpeed/issues/2858 when compiling deepspeed >= 0.8.1 on a machine without gpu (e.g. docker image build).
Currently, if we load models with different dtypes in the same process, we would get an error like ``` File ~/.miniconda3/lib/python3.8/site-packages/vllm/_custom_ops.py:89, in rotary_embedding(positions, query, key, head_size, cos_sin_cache, is_neox) 81 def...
Some models would overflow when using fp16 inference (e.g. Deepseek-V2), so we should add bfloat16 support for quantization kernel. This PR add bfloat16 support for **gptq marlin kernel**. Unlike gptq...
Some models would overflow when using fp16 inference (e.g. Deepseek-V2), so we should add bfloat16 support for quantization kernel. This PR add bfloat16 support for gptq kernel. Related issue: https://github.com/vllm-project/vllm/issues/2149...
### Your current environment ```text The output of `python collect_env.py` ``` ### 🐛 Describe the bug Vllm load lora checkpoints when executing model https://github.com/vllm-project/vllm/blob/v0.4.2/vllm/worker/model_runner.py#L789-L790 https://github.com/vllm-project/vllm/blob/v0.4.2/vllm/lora/worker_manager.py#L138-L172 Then when we get an...
https://github.com/vllm-project/vllm/pull/12185 add triton moe wna16 kernel, but triton cannot reach the best performance when m is small. This PR add the moe wna16 cuda kernel. It have better generation speed...
The performance of gptq/awq marlin kernel is low when n is small. The reason is that when n is small, the `barrier_acquire` and `barrier_release` cost too many time. The case...
https://github.com/vllm-project/vllm/pull/12185 and https://github.com/vllm-project/vllm/pull/13321 introduced the triton/cuda moe wna16 kernel to optimize the performance of moe gptq/awq. However, the best-performing gptq/awq kernel currently is the marlin kernel, and I hope to...
This PR optimizes dense marlin kernel and moe marlin kernel. Summary: - **(dense marlin only)** Migrate the optimization method introduced for the moe marlin kernel in https://github.com/vllm-project/vllm/pull/14447 to the dense...