llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

feat: add streaming tool use

Open lsorber opened this issue 1 year ago • 7 comments

This PR upgrades the chatml-function-calling chat handler with support for streaming tool use and fixes #1883, #1869, and #1756, among other improvements.

Changes:

  1. General: a. ✨ If no system message is supplied, add an empty system message to hold the tool metadata. b. ✨ Add function descriptions to the system message so that tool use is better informed (fixes #1869). c. ✨ Replace print statements relating to JSON grammars with RuntimeWarning warnings. d. ✅ Add tests with fairly broad coverage of the different scenarios.
  2. Case "Tool choice by user": a. ✨ Add support for more than one function call by making this a special case of "Automatic tool choice" with a single tool (subsumes #1503).
  3. Case "Automatic tool choice -> respond with a message": a. ✨ Use user-defined stop and max_tokens. b. 🐛 Replace incorrect use of follow-up grammar with user-defined grammar.
  4. Case "Automatic tool choice -> one or more function calls": a. ✨ Add support for streaming the function calls (fixes #1883). b. ✨ Make tool calling more robust by giving the LLM an explicit way to terminate the tool calls by wrapping them in a <function_calls></function_calls> block. c. 🐛 Add missing ":" stop token to determine whether to continue with another tool call, which prevented parallel function calling (fixes #1756). d. ✨ Set temperature=0 to determine whether to continue with another tool call, similar to the initial decision on whether to call a tool.

lsorber avatar Dec 25 '24 23:12 lsorber

@abetlen The tests all pass, but the macOS ones were terminated after a timeout. I think this is because of a lack of CPU and or memory resources because the tests run fine on my macOS machine.

lsorber avatar Dec 26 '24 20:12 lsorber

I would love to see this merged! Actually there are quite a lot of good pull requests here that i would like to see merged... But this one is top priority!

SubatomicPlanets avatar Jan 04 '25 00:01 SubatomicPlanets

Update: I rebased on the latest main and included a few tiny improvements to further improve tool calling robustness.

lsorber avatar Jan 05 '25 14:01 lsorber

Update: I rebased on the latest main and conditionally skipped the added tests on macOS when not enough resources are available to run them.

lsorber avatar Jan 12 '25 15:01 lsorber

Worked well for me, would you mind rebasing to the latest commit to allow for tool streaming with Qwen models? Thanks for your work!

LenBanana avatar Jan 31 '25 13:01 LenBanana

Would love to see this merged - is there anything holding it up?

conornash avatar Mar 04 '25 11:03 conornash

@abetlen I rebased the PR on the latest upstream main and added a small commit to fix the returned logprobs format.

lsorber avatar Mar 14 '25 13:03 lsorber