Ramil Nugmanov comments

Repositories
Issues
Comments

Results 2 comments of


                                            Ramil Nugmanov

Fix KV cache

the cache can be preallocated, but an iterative calculation of attention of Q to the max_seq tokens (K) with later masking and MM with max_seq V is not efficient

Fix KV cache

Sure, I launched generation with the default prompt "Tell me a joke?" and max new tokens "300" on llama3.1-8B-Ins. I reduced max_seq_len to 64K (default value doesn't fit into A100/80GB)...