Regression in unified KV cache appears after `llama.cpp` release b5912 in b5913
This issue concerns the llama-cpp-python community but was filed on the llama.cpp tracker first: https://github.com/ggml-org/llama.cpp/issues/14847.
I just wanted to bring it to your attention. I can relocate the issue if it is more relevant here. For your convenience, the issue description is reproduced here:
Running llama-cpp-python against llama.cpp compiled after b5912, in b5913, results in:
llama.cpp/src/llama-kv-cache-unified.cpp:222: GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < seq_to_stream.size()) failed
It appears to be a regression in sequence ID handling or unified KV cache logic affecting external bindings. This is consistent with the heavy work done on the kv-cache to prepare K/V buffers for separation in b5913.
NOTE: llama-cli runs successfully, but running llama-cpp-python against llama.cpp with the same model fails.
You can try my fork code, https://github.com/JamePeng/llama-cpp-python/commit/8d981f0455b3adabf46417e3f9304ca6b70357ed this commit had fixed this problem!
@akarasulu I believe this was fixed by @iamlemec and should be live in the recent 0.3.15 release
The ASSERT code has been updated in llama.cpp (seq_id==-1 has been added). You can try to update the version of \vendor\llama.cpp.