Implement kv cache sparsity like H2O with attention score

Open HarryWu99 opened this issue 1 year ago • 0 comments

Feature request

Hello!

It is a bit like #26553, which implement SinkCache. I would love to see some method of kv cache sparsity like H2O implemented, as proposed in http://arxiv.org/abs/2405.04434.

The authors have release the code here: https://github.com/FMInference/H2O.

People can use it like:

from transformers import AutoModelForCausalLM AutoTokenizer, H2O_Cache

cache = H2O_Cache(recent_length=512, HH_length=512)
gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=3000, past_key_values=cache)

Motivation

Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. a KV cache eviction policy that dynamically retains a balance of recent and H2 tokens

Your contribution

I would love to help implement this into transformers.

It is not only implement a H2Ocache in src/transformers/cache_utils.py, but also change the order of some code in LlamaAttention#forward function, so Cache#update can get the attention score, which some method of kv cache sparsity like snapKV and future work also need.

May 11 '24 11:05 HarryWu99