llama.cpp
llama.cpp copied to clipboard
llama-server bug: Prompt caching fails when editing the second user input
Name and Version
I'm using the current latest llama-server.
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama.cpp/build/bin/llama-server --model models/DeepSeek-V3-0324-UD-IQ3_XXS-00001-of-00006.gguf -ngl 4 -c 16000 -ctk q8_0
Problem description & steps to reproduce
- If I enter a long context as part of a user query, this works after a long processing time (I'm using CPU offloading). ✔️
- If I then edit the end of that long input and submit again, the model response updates pretty quickly as the prompt is cached. ✔️
- If I then add a second user input after the first response, this also starts outputting a second response quickly, as expected. ✔️
- But, if I then edit the second user input and resubmit it, the entire context is processed from the start again, taking a very long time. Prompt caching seems to have failed. ❌
I'm not attempting any prompt caching across runs, this is all within a single session.
Edit: There's a possibility I've overrun the context size of 16000, I'd best check that before I say that this is a bug.
Relevant log output