QuaRot icon indicating copy to clipboard operation
QuaRot copied to clipboard

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Results 16 QuaRot issues
Sort by recently updated
recently updated
newest added

kv_indicies -> kv_indices

Hi! Thanks for the great work! Been playing with the code today and trying to reproduce Figure 1 in the paper, and here's what I got. ![image](https://github.com/spcl/QuaRot/assets/50691954/f254ed8f-9173-419b-a60a-5abee3d68a9f) I noticed that...

SpinQuant is a subsequent work to QuaRot. However, we have noticed that the definitions of the rotation matrix pairing details differ between the two papers. In QuaRot, first, there is...

I encountered an unexpected precision loss while using Quarot. I conducted comparison experiments on LLaMA-2-7b: Performing w4a16 RTN quantization on the model resulted in a PPL (Perplexity) of 7.354664. Performing...

Same idea as [here](https://github.com/microsoft/TransformerCompression/blob/6b12cdee6ad51791d7c776b3a046bc408b9e77e9/src/slicegpt/layernorm_fusion.py#L83-L85). opt-125m is impacted by this.

I've noticed QuaRot and other KV cache papers include perplexity, but it is unclear to me how a quantized KV cache is used during perplexity calculation. Do you have a...

![image](https://github.com/user-attachments/assets/8a6bc6db-8fd5-4e02-b4a5-4cc6c8c28f1a) Thanks for the wonderful work, however, i have some problem with the code. I've encountered a problem with the code implementation as described in the Introduction of your paper....

Description: I am experiencing a significant precision drop when using the quarot algorithm on a device limited to float32 calculations. Originally designed for double precision, the rotations are cast to...

The main modifications to support Llama 3.1 and 3.2: - In case of Llama 3.2 , `tie_word_embedding=True` so we need to do only one the rotation on input embedding as...