llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

Cache misses previous generation

Open ultoris opened this issue 1 year ago • 2 comments

Expected Behavior

The server should cache both the previous prompt and the last generation.

Current Behavior

The cache misses at the end of the previous prompt, forcing to evaluate the previous answer in full.

Environment and Context

I'm interfacing llama-cpp-python server trough a SillyTavern instance working in OpenAI compatible mode. System is Linux orangepi5 6.1.43-rockchip-rk3588 #1.1.8 SMP Fri Feb 2 21:16:10 CST 2024 aarch64 GNU/Linux with Python 3.11.7, GNU Make 4.3 and g++ (Debian 12.2.0-14) 12.2.0

llama-cpp-python: commit 02812148635bf6337ffc7d1abb34093f4065df88 fastapi 0.110.2 numpy 1.26.4 sse-starlette 2.1.0 uvicorn 0.29.0 vendor/llama.cpp: commit 0e4802b2ecbaab04b4f829fde4a3096ca19c84b5 Author: loonerin [email protected] Date: Fri Apr 19 13:03:35 2024 -0400

Model metadata as reported by llama-cpp-python:

Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'llama.context_length': '8192', 'general.name': 'llama-3-8b-Instruct', 'llama.vocab_size': '128256', 'general.file_type': '15', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'}
Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}
Using chat eos_token: <|eot_id|>
Using chat bos_token: <|begin_of_text|>

Failure Information (for bugs)

It seems that the last cached token is replaced by two different tokens in the next request.

Steps to Reproduce

  1. Run the server with: python3 -m llama_cpp.server --cache true --n_ctx 8192 --seed 0 --n_threads 4 --n_threads_batch 4 --model ../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf --port 8080 --verbose true --cache_type ram --use_mlock true
  2. Send the message "Write one word". SillyTavern sends the following message (captured trough tcpdump): {messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0} the last streamed messages are:

data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": ":\n\n"}, "logprobs": null, "finish_reason": null}]}

data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "Hello"}, "logprobs": null, "finish_reason": null}]}

data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {}, "logprobs": null, "finish_reason": "stop"}]}

  1. Send the message "Hi". SillyTavern sends the following message (captured trough tcpdump): {"messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"},{"role":"assistant","content":"Hello! You said \"one word\", so I'll respond with:\n\nHello"},{"role":"user","content":"Hi"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0}
  2. I've added the following prints to generate() in llama.py
                if self.verbose:
                    print("Llama.generate: prefix-match hit ({pre}/{prompt}/{total}).".format(pre=longest_prefix,total=self.n_tokens,prompt=len(tokens)))
                    print("Llama.generate: prefix-prompt ",repr(self.detokenize(tokens)),file=sys.stderr)
                    print("Llama.generate: prefix-match: ",repr(self.detokenize(self._input_ids[:longest_prefix])), file=sys.stderr)
                    print("Llama.generate: prefix-miss: ",repr(self.detokenize(self._input_ids[longest_prefix:])), file=sys.stderr)

                    for i,p in enumerate(zip(self._input_ids,tokens)):
                        print("{idx: <8}{a: <8}{b: <8}".format(idx=i,a=p[0],b=p[1]),file=sys.stderr)

Which give the following output:

Llama._create_completion: cache saved
Llama.generate: prefix-match hit (31/61/47).
Llama.generate: prefix-prompt  b'ystem\n\n[Start a new Chat]assistant\n\nHellouser\n\nWrite one wordassistant\n\nHello! You said "one word", so I\'ll respond with:\n\nHellouser\n\nHiassistant\n\n'
Llama.generate: prefix-match:  b'ystem\n\n[Start a new Chat]assistant\n\nHellouser\n\nWrite one wordassistant'
Llama.generate: prefix-miss:  b'\n\nHello! You said "one word", so I\'ll respond with:\n\nHello'
0       128000  128000
1       128006  128006
2       9125    9125
3       128007  128007
4       271     271
5       58      58
6       3563    3563
7       264     264
8       502     502
9       13149   13149
10      60      60
11      128009  128009
12      128006  128006
13      78191   78191
14      128007  128007
15      198     198
16      198     198
17      9906    9906
18      128009  128009
19      128006  128006
20      882     882
21      128007  128007
22      198     198
23      198     198
24      8144    8144
25      832     832
26      3492    3492
27      128009  128009
28      128006  128006
29      78191   78191
30      128007  128007
31      271     198
32      9906    198
33      0       9906
34      1472    0
35      1071    1472
36      330     1071
37      606     330
38      3492    606
39      498     3492
40      779     498
41      358     779
42      3358    358
43      6013    3358
44      449     6013
45      1473    449
46      9906    512

As can be seen, token 31 (271) is replaced by two (198) so the cache does not match the prompt. However, I cannot see any difference in messages nor detokenized strings.

ultoris avatar Apr 21 '24 02:04 ultoris

Same here, and not just llama 3, all other models are all have this issue after a recent update of llama.cpp.

Liquorice10113 avatar May 01 '24 07:05 Liquorice10113

I also notice that the second prompt is slower than e.g. in gpt4all with otherwise same setup

woheller69 avatar May 12 '24 07:05 woheller69

~I can confirm this. I suspect that the issue started with version 0.2.77.~

~I've booted the server and ran this curl multiple times:~

curl --location 'http://127.0.0.1:8080/v1/completions' \
--header 'Content-Type: application/json' \
--data '{
  "prompt": "<BIG PROMPT BUT PROPERLY FORMATTED>",
  "max_tokens": 2048,
  "temperature": 0,
  "stream": false
}'

(And I don't know if this is normal, but I notice that the very first call has a different output to the subsequent ones. With temperature=0 I was expecting them to be all identical)

[EDIT] Ignore my comment. I was looking at raw times instead of token/second.

Vaskivo avatar Jun 21 '24 23:06 Vaskivo

maybe the \n (s) are stripped off. See https://github.com/Maximilian-Winter/llama-cpp-agent/pull/73

woheller69 avatar Jul 10 '24 15:07 woheller69

+, I have the same issue for version 0.2.82

futurisold avatar Jul 12 '24 17:07 futurisold

+, I have the same issue for version 0.2.82

Check your prompt template

woheller69 avatar Jul 12 '24 18:07 woheller69

+, I have the same issue for version 0.2.82

Check your prompt template

I suppose that the llama-cpp-python server should mirror the llama-cpp server, which works with the prompt that I'm currently using, so it's clearly not the prompt.

futurisold avatar Jul 12 '24 18:07 futurisold