Jinzhen Lin issues

Results 14 issues of


                                            Jinzhen Lin

Bug when renaming file (notebook 6.4.12)

**Describe the bug** In notebook 6.4.12，we cannot rename file sometimes. **To Reproduce** Steps to reproduce the behavior: 1. Install notebook 6.4.12 2. Start notebook with `root_dir="/home/foobar"` 3. Create a directory...

bug

remove `torch.cuda.is_available()` check when compiling ops

`torch.cuda.is_available()` is not necessary here. And I would cause https://github.com/microsoft/DeepSpeed/issues/2858 when compiling deepspeed >= 0.8.1 on a machine without gpu (e.g. docker image build).

[Bugfix] fix rope error when load models with different dtypes

Currently, if we load models with different dtypes in the same process, we would get an error like ``` File ~/.miniconda3/lib/python3.8/site-packages/vllm/_custom_ops.py:89, in rotary_embedding(positions, query, key, head_size, cos_sin_cache, is_neox) 81 def...

[Kernel] add bfloat16 support for gptq marlin kernel

Some models would overflow when using fp16 inference (e.g. Deepseek-V2), so we should add bfloat16 support for quantization kernel. This PR add bfloat16 support for **gptq marlin kernel**. Unlike gptq...

[Kernel] add bfloat16 support for gptq kernel

Some models would overflow when using fp16 inference (e.g. Deepseek-V2), so we should add bfloat16 support for quantization kernel. This PR add bfloat16 support for gptq kernel. Related issue: https://github.com/vllm-project/vllm/issues/2149...

[Bug]: single lora request error make all processing requests error

### Your current environment ```text The output of `python collect_env.py` ``` ### 🐛 Describe the bug Vllm load lora checkpoints when executing model https://github.com/vllm-project/vllm/blob/v0.4.2/vllm/worker/model_runner.py#L789-L790 https://github.com/vllm-project/vllm/blob/v0.4.2/vllm/lora/worker_manager.py#L138-L172 Then when we get an...

bug

[Kernel] moe wna16 cuda kernel

https://github.com/vllm-project/vllm/pull/12185 add triton moe wna16 kernel, but triton cannot reach the best performance when m is small. This PR add the moe wna16 cuda kernel. It have better generation speed...

ci/build

[Kernel] optimize performance of gptq marlin kernel when n is small

The performance of gptq/awq marlin kernel is low when n is small. The reason is that when n is small, the `barrier_acquire` and `barrier_release` cost too many time. The case...

performance

ready

[Kernel] moe wna16 marlin kernel

https://github.com/vllm-project/vllm/pull/12185 and https://github.com/vllm-project/vllm/pull/13321 introduced the triton/cuda moe wna16 kernel to optimize the performance of moe gptq/awq. However, the best-performing gptq/awq kernel currently is the marlin kernel, and I hope to...

quantization

kernel

ready

ci/build

[Kernel] some optimizations for dense marlin and moe marlin

This PR optimizes dense marlin kernel and moe marlin kernel. Summary: - **(dense marlin only)** Migrate the optimization method introduced for the moe marlin kernel in https://github.com/vllm-project/vllm/pull/14447 to the dense...

performance

quantization

ready

ci/build