Implement kv cache sparsity like H2O with attention score
Feature request
Hello!
It is a bit like #26553, which implement SinkCache. I would love to see some method of kv cache sparsity like H2O implemented, as proposed in http://arxiv.org/abs/2405.04434.
The authors have release the code here: https://github.com/FMInference/H2O.
People can use it like:
from transformers import AutoModelForCausalLM AutoTokenizer, H2O_Cache
cache = H2O_Cache(recent_length=512, HH_length=512)
gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=3000, past_key_values=cache)
Motivation
Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. a KV cache eviction policy that dynamically retains a balance of recent and H2 tokens
Your contribution
I would love to help implement this into transformers.
It is not only implement a H2Ocache in src/transformers/cache_utils.py, but also change the order of some code in LlamaAttention#forward function, so Cache#update can get the attention score, which some method of kv cache sparsity like snapKV and future work also need.