mzchtx comments

Results 5 comments of


                                            mzchtx

Slice before F.scaled_dot_product_attention() to improve the performance

> Hi @mzchtx. What changes are you proposing precisely? > > `k, v` should already be sliced to the length of `input_pos` with https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/model.py#L217-L218 `input_copy` does not change the shape...

Slice before F.scaled_dot_product_attention() to improve the performance

> @mzchtx Would you like to open a PR with your suggested changes? I tested more data and had a more detailed comparison: 1. The current implementation is more efficient...

Slice before F.scaled_dot_product_attention() to improve the performance

> From playing with this, the generated outputs are not the same, meaning that this is not numerically equivalent. However, it's hard to tell if they are worse or just...

Use index_copy_ to reduce memory copies

> @mzchtx Did you measure the performance difference? > > Would you like to open a PR with your suggestion? Yes, however, this primarily impacts inference performance. During our testing,...

cublasLt runs into an error on 8 bit quantized

> Same issue, I guess this is a known issue with llm.int8(), can be referenced [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers,...