llama3.java icon indicating copy to clipboard operation
llama3.java copied to clipboard

Added reading and writing of a state-cache

Open srogmann opened this issue 1 year ago • 2 comments

This PR contains an exemplary implementation for storing the computed states of a system-prompt on disk.

I wrote this implementation a few weeks ago, so it may not be mergeable directly.

srogmann avatar Oct 05 '24 22:10 srogmann

Hey @srogmann, this this awesome! I rebased an played with it, there are a some rough edges.

  • The cached prompt must match exactly the given prompt, to improve usability is should start from the largest prefix that matches.
  • Needs some usability polish, it's not clear (no docs) how to cache a prompt and how to use it.

This feature is a must have, and I'd really like caching to be composable, so it can be easily and transparently turned on for e.g. completions API. I've discussed with a colleague how to do caching on disk and we have some ideas, note that KV caches can be also quantized to save memory and disk space.

I'm busy this week, but this is a good start!

mukel avatar Oct 07 '24 06:10 mukel

Hi @mukel, the first version of the state-cache was deliberately kept somewhat brief. I added a support of largest prefix and some documentation. I added the name of the gguf-file to avoid confusion.

srogmann avatar Oct 15 '24 21:10 srogmann