mzchtx
mzchtx
> Hi @mzchtx. What changes are you proposing precisely? > > `k, v` should already be sliced to the length of `input_pos` with https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/model.py#L217-L218 `input_copy` does not change the shape...
> @mzchtx Would you like to open a PR with your suggested changes? I tested more data and had a more detailed comparison: 1. The current implementation is more efficient...
> From playing with this, the generated outputs are not the same, meaning that this is not numerically equivalent. However, it's hard to tell if they are worse or just...
> @mzchtx Did you measure the performance difference? > > Would you like to open a PR with your suggestion? Yes, however, this primarily impacts inference performance. During our testing,...
> Same issue, I guess this is a known issue with llm.int8(), can be referenced [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers,...