chenqianfzh issues

Results 7 issues of


                                            chenqianfzh

support QLoRA

QLoRA (https://arxiv.org/abs/2305.14314) cuts memory consumption in LLM weight loading without degrading performance. The weights of the basic model , which are quantized into 4 bit, pair with a low-rank but...

out kwarg in matmul_4bit() is not working

### System Info Linux 20.04 ### Reproduction output is defined as a tensor. The following works as expected: `output = matmul_4bit(a, b)` but the following does not, the elements in...

Bug

Contributions Welcome

[distributed][kernel]support tensor-parallelism in bitsandbytes quant…

This PR provides tensor parallelism to bitsandbytes quantization, which was added to vLLM in https://github.com/vllm-project/vllm/pull/4776.

[Core] offload model weights to CPU conditionally

We are developing the "conditional cpu-offload-weight" feature for vLLM, which is comparable to Hugging Face's Accelerate device_map='auto'. This democratizes access to vLLM, empowering a broader community of learners and researchers...

[Feature][kernel] tensor parallelism with bitsandbytes quantization

This PR provides tensor parallelism to bitsandbytes quantization. It is verified, on llama2 and llama3 models, that the generated texts are the same as no TP.

quantize_4bit/dequantize_4bit gives wrong output on in-contiguous tensor

### System Info Linux ### Reproduction I hit this bug during my coding work. I verified it with the following script. With bitsandbytes 0.43.3, it shows that half of the...

store/retrieve hidden states in PD Disagg

This PR allows vllm lmcache connector to store/retrieve hidden_states in PD disaggregation. So the first iteration in the consumer side does not need to re-compute. End-to-End verified with LLama as...