llama-cpp-python
llama-cpp-python copied to clipboard
Remove subsequences of cached tokens to match a longer prefix
This is useful if your text is too large to fit in to the cache all at once. Previously, removing tokens from the beginning or middle would require reevaluating all tokens after that. This shifts already evaluated tokens to keep them in the cache, so there's less work to do before we can start generating new tokens.