Ramil Nugmanov

Results 2 comments of Ramil Nugmanov

the cache can be preallocated, but an iterative calculation of attention of Q to the max_seq tokens (K) with later masking and MM with max_seq V is not efficient

Sure, I launched generation with the default prompt "Tell me a joke?" and max new tokens "300" on llama3.1-8B-Ins. I reduced max_seq_len to 64K (default value doesn't fit into A100/80GB)...