QuaRot
QuaRot copied to clipboard
Code for QuaRot, an end-to-end 4-bit inference of large language models.
kv_indicies -> kv_indices
Hi! Thanks for the great work! Been playing with the code today and trying to reproduce Figure 1 in the paper, and here's what I got.  I noticed that...
SpinQuant is a subsequent work to QuaRot. However, we have noticed that the definitions of the rotation matrix pairing details differ between the two papers. In QuaRot, first, there is...
I encountered an unexpected precision loss while using Quarot. I conducted comparison experiments on LLaMA-2-7b: Performing w4a16 RTN quantization on the model resulted in a PPL (Perplexity) of 7.354664. Performing...
Same idea as [here](https://github.com/microsoft/TransformerCompression/blob/6b12cdee6ad51791d7c776b3a046bc408b9e77e9/src/slicegpt/layernorm_fusion.py#L83-L85). opt-125m is impacted by this.
I've noticed QuaRot and other KV cache papers include perplexity, but it is unclear to me how a quantized KV cache is used during perplexity calculation. Do you have a...
 Thanks for the wonderful work, however, i have some problem with the code. I've encountered a problem with the code implementation as described in the Introduction of your paper....
Description: I am experiencing a significant precision drop when using the quarot algorithm on a device limited to float32 calculations. Originally designed for double precision, the rotations are cast to...
The main modifications to support Llama 3.1 and 3.2: - In case of Llama 3.2 , `tie_word_embedding=True` so we need to do only one the rotation on input embedding as...