[BUG] layer_past is ignored by DeepSpeedSelfAttention's compute_attention
Describe the bug The past_key_values provided to the model are ignored by DeepSpeedSelfAttention.
To Reproduce
Call a deep speed inference model with past_key_values and note that at the following line the data (now stored as the layer_past argument) gets ignored: https://github.com/microsoft/DeepSpeed/blob/2f8d384e8bf3644e11e7bb2c658ddfcea7c611b1/deepspeed/ops/transformer/inference/ds_attention.py#L90
It looks like the "BloomSelfAttention" class does not ignore layer_past, but that is not the class that is used by default.
Expected behavior I use past_key_values to enable acceleration of structured outputs for Transformers models: https://github.com/microsoft/guidance/blob/main/notebooks/guidance_acceleration.ipynb I would like to be able to do the same for DeepSpeed models.
For standard transformers passing past_key_values prefixes those key-values to before the input_ids when computing attention.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sclundbe/anaconda3/lib/python3.9/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/home/sclundbe/anaconda3/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU count and types: 4x A6000 but does not matter here
- Python version: 3.9
Hi @slundberg,
The cache is controlled by internally by DS-inference. As long as you pass the initial context (input-prompt) to DS-inference, it will consider the previous context as the KV-Cache to generate the new tokens. The layer_past is passed as input to the transformer API just for the compatibility with transformers library (and several others), however, we use our own caching scheme which has similar accuracy as the baseline.
Thanks, Reza
Thanks. Does DS-inference expose a similar functionality to past_key_values directly? I am trying to interleave generation and prompts in a way that preserves the KV cache, this means I need to pass a starting KV cache to DS-inference, in addition to a prompt suffix to that KV cache (so that it does not recompute the first part of the prompt every time I call it during a session).
Side note, I think an error should get thrown if layer_past is sent, because right now the model runs and silently ignores the cache context. (and GetMaxTokenLenght should probably get find/replaced to GetMaxTokenLength in the cpp code)
Just to add a bit more detail, the blocker I ran into was that the C++ interface assumes that if more than 1 token is passed in the input then the cache needs to be cleared: https://github.com/microsoft/DeepSpeed/blob/085981bf1caf5d7d0b26d05f7c7e9487e1b35190/csrc/transformer/inference/csrc/pt_binding.cpp#L451-L453
But I am trying to keep the cache (or a prefix of it) and when passing a batch of tokens.
Hi @slundberg and @RezaYazdaniAminabadi , I am encountering the same problem: I want to pass (e.g.) 5 input ids and past_key_values of 10 tokens which came before those 5. But DeepSpeed does not consider them (it does consider those 10, if I just inputted 1 input id).
The above is necessary to realize significant inference speed gains from this Apr 2023 paper by Microsoft.
This paper describes an approach to guess the next k>=1 tokens in a generation. If you guess correctly, you mostly get k+1 generated tokens in 1 forward pass, improving the speed per token by a factor of k+1. But only if you can re-use the initial past_key_values. Otherwise you (typically) loose more than you win. This is, unless you are a guessing genius, in which case you probably don't need the model at all.
Also, fixing the current issue would probably also fix this non-deterministic behavior I reported, which occurs for context size 1.
@RezaYazdaniAminabadi Hi,
Is this likely to get fixed at some point? I think that managing the past kv cache externally is required for some cases such as guidance (as referenced above), continuous batching, etc.
It would be great if passing the past kv cache is supported at some point.