Ramil Nugmanov
Results
2
comments of
Ramil Nugmanov
the cache can be preallocated, but an iterative calculation of attention of Q to the max_seq tokens (K) with later masking and MM with max_seq V is not efficient
Sure, I launched generation with the default prompt "Tell me a joke?" and max new tokens "300" on llama3.1-8B-Ins. I reduced max_seq_len to 64K (default value doesn't fit into A100/80GB)...