KV Cache in the transformers

Open ramyaprabhu-alt opened this issue 2 years ago • 1 comments

❓ Questions and Help

What is your question?

I'm a beginner, so please excuse my rather naive question: Does fairseq's implementation of a TransformerDecoder have a KV cache? Is the incremental_state the kv cache?

What's your environment?

fairseq Version (e.g., 1.0 or main): main
PyTorch Version (e.g., 1.0): 2.1.0
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): Source
Python version: 3.10.6
GPU models and configuration: the MoE model [for language modeling] and the dense decoder only model

Aug 30 '23 05:08 ramyaprabhu-alt

It seems yes. Please check this: https://github.com/facebookresearch/fairseq/blob/920a548ca770fb1a951f7f4289b4d3a0c1bc226f/fairseq/model_parallel/modules/multihead_attention.py#L128

Jul 26 '24 03:07 manliu1225