Feature request: add support for streaming tool use
The combination stream=True, tool_choice="auto" raises an exception right now, which means that developers are stuck with one of two unfortunate choices:
- Developing an application that streams the response but cannot use tools
- Developing an LLM application that can use tools but cannot stream the response
Relevant discussion: https://github.com/abetlen/llama-cpp-python/discussions/1615
Admittedly this is the wrong place to ask this question but as a beginner I feel like you're the right person to answer:
Does something need to be done to llama.cpp directly in order to handle streaming tool calling? I see from your feature branch that you added a RAG layer to this python implementation. I ask because I built llamma.cpp from source figuring it would be better optimized for my system, but I am stuck with this server error
{"code":500,"message":"Cannot use tools with stream","type":"server_error"}.
Is it the case that if I installed the pre-built python version that this would go away?
Edit: I see here that there's a PR in draft. We're too close to the bleeding edge!
Llama.cpp is still waiting on https://github.com/ggml-org/llama.cpp/pull/12379
I'm not sure how this python library handles tools. I think it is somewhat different though.
https://github.com/ggml-org/llama.cpp/pull/12379 has been merged!