llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

llama-server bug: Prompt caching fails when editing the second user input

Open araleza opened this issue 9 months ago • 0 comments

Name and Version

I'm using the current latest llama-server.

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama.cpp/build/bin/llama-server --model models/DeepSeek-V3-0324-UD-IQ3_XXS-00001-of-00006.gguf -ngl 4 -c 16000 -ctk q8_0

Problem description & steps to reproduce

  • If I enter a long context as part of a user query, this works after a long processing time (I'm using CPU offloading). ✔️
  • If I then edit the end of that long input and submit again, the model response updates pretty quickly as the prompt is cached. ✔️
  • If I then add a second user input after the first response, this also starts outputting a second response quickly, as expected. ✔️
  • But, if I then edit the second user input and resubmit it, the entire context is processed from the start again, taking a very long time. Prompt caching seems to have failed. ❌

I'm not attempting any prompt caching across runs, this is all within a single session.

Edit: There's a possibility I've overrun the context size of 16000, I'd best check that before I say that this is a bug.

Relevant log output


araleza avatar Apr 26 '25 15:04 araleza