llama-server bug: Prompt caching fails when editing the second user input

Open araleza opened this issue 9 months ago • 0 comments

I'm using the current latest llama-server.

Linux

llama-server

llama.cpp/build/bin/llama-server --model models/DeepSeek-V3-0324-UD-IQ3_XXS-00001-of-00006.gguf -ngl 4 -c 16000 -ctk q8_0

If I enter a long context as part of a user query, this works after a long processing time (I'm using CPU offloading). ✔️
If I then edit the end of that long input and submit again, the model response updates pretty quickly as the prompt is cached. ✔️
If I then add a second user input after the first response, this also starts outputting a second response quickly, as expected. ✔️
But, if I then edit the second user input and resubmit it, the entire context is processed from the start again, taking a very long time. Prompt caching seems to have failed. ❌

I'm not attempting any prompt caching across runs, this is all within a single session.

Edit: There's a possibility I've overrun the context size of 16000, I'd best check that before I say that this is a bug.

Apr 26 '25 15:04 araleza