llama-cpp-python Regression in unified KV cache appears after `llama.cpp` release b5912 in b5913

This issue concerns the llama-cpp-python community but was filed on the llama.cpp tracker first: https://github.com/ggml-org/llama.cpp/issues/14847.

I just wanted to bring it to your attention. I can relocate the issue if it is more relevant here. For your convenience, the issue description is reproduced here:

Running llama-cpp-python against llama.cpp compiled after b5912, in b5913, results in:

llama.cpp/src/llama-kv-cache-unified.cpp:222: GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < seq_to_stream.size()) failed

It appears to be a regression in sequence ID handling or unified KV cache logic affecting external bindings. This is consistent with the heavy work done on the kv-cache to prepare K/V buffers for separation in b5913.

NOTE: llama-cli runs successfully, but running llama-cpp-python against llama.cpp with the same model fails.

Jul 24 '25 06:07 akarasulu

You can try my fork code, https://github.com/JamePeng/llama-cpp-python/commit/8d981f0455b3adabf46417e3f9304ca6b70357ed this commit had fixed this problem!

Aug 04 '25 06:08 JamePeng

@akarasulu I believe this was fixed by @iamlemec and should be live in the recent 0.3.15 release

Aug 07 '25 15:08 abetlen

The ASSERT code has been updated in llama.cpp (seq_id==-1 has been added). You can try to update the version of \vendor\llama.cpp.

kv-cache : fix seq_rm with seq_id == -1

Aug 13 '25 02:08 JamePeng