llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

Remove subsequences of cached tokens to match a longer prefix

Open m42a opened this issue 2 years ago • 0 comments

This is useful if your text is too large to fit in to the cache all at once. Previously, removing tokens from the beginning or middle would require reevaluating all tokens after that. This shifts already evaluated tokens to keep them in the cache, so there's less work to do before we can start generating new tokens.

m42a avatar Jan 19 '24 06:01 m42a