llama-cpp-python Feature request: add support for streaming tool use

The combination stream=True, tool_choice="auto" raises an exception right now, which means that developers are stuck with one of two unfortunate choices:

Developing an application that streams the response but cannot use tools
Developing an LLM application that can use tools but cannot stream the response

Relevant discussion: https://github.com/abetlen/llama-cpp-python/discussions/1615

Dec 25 '24 23:12 lsorber

Admittedly this is the wrong place to ask this question but as a beginner I feel like you're the right person to answer:

Does something need to be done to llama.cpp directly in order to handle streaming tool calling? I see from your feature branch that you added a RAG layer to this python implementation. I ask because I built llamma.cpp from source figuring it would be better optimized for my system, but I am stuck with this server error {"code":500,"message":"Cannot use tools with stream","type":"server_error"}.

Is it the case that if I installed the pre-built python version that this would go away?

Edit: I see here that there's a PR in draft. We're too close to the bleeding edge!

Apr 22 '25 21:04 SaymV

Llama.cpp is still waiting on https://github.com/ggml-org/llama.cpp/pull/12379

I'm not sure how this python library handles tools. I think it is somewhat different though.

Apr 22 '25 21:04 edmcman

https://github.com/ggml-org/llama.cpp/pull/12379 has been merged!

May 25 '25 17:05 kooshi