chenqianfzh
chenqianfzh
QLoRA (https://arxiv.org/abs/2305.14314) cuts memory consumption in LLM weight loading without degrading performance. The weights of the basic model , which are quantized into 4 bit, pair with a low-rank but...
### System Info Linux 20.04 ### Reproduction output is defined as a tensor. The following works as expected: `output = matmul_4bit(a, b)` but the following does not, the elements in...
This PR provides tensor parallelism to bitsandbytes quantization, which was added to vLLM in https://github.com/vllm-project/vllm/pull/4776.
We are developing the "conditional cpu-offload-weight" feature for vLLM, which is comparable to Hugging Face's Accelerate device_map='auto'. This democratizes access to vLLM, empowering a broader community of learners and researchers...
This PR provides tensor parallelism to bitsandbytes quantization. It is verified, on llama2 and llama3 models, that the generated texts are the same as no TP.
### System Info Linux ### Reproduction I hit this bug during my coding work. I verified it with the following script. With bitsandbytes 0.43.3, it shows that half of the...
This PR allows vllm lmcache connector to store/retrieve hidden_states in PD disaggregation. So the first iteration in the consumer side does not need to re-compute. End-to-End verified with LLama as...