[EVAL] Add tau-bench
Evaluation short description
Similar to BFCL in #873, tau-bench is a popular agentic benchmark that is used to measure the ability of LLMs to use tools in real world domains (e.g. booking flights). It is very popular and reported in modern releases, including those from frontier labs.
We could focus on the τ^2 variant that is more recent.
Evaluation metadata
Provide all available
- Paper url: https://arxiv.org/abs/2406.12045
- Github url: https://github.com/sierra-research/tau-bench
- Dataset url: simulated
Hi, would it be okay if an external contributor worked on this task?
We'd be delighted! :)
Hi @yijun-lee just FYI that the tau2_bench data is stored directly in their GitHub repo, so I created a copy on the Hub that can be downloaded at runtime: https://huggingface.co/datasets/HuggingFaceH4/tau2-bench-data
Note that this is a very complex benchmark and that for the vLLM backend, one must specify a specific tool-calling and reasoning parser to get correct results:
- https://docs.vllm.ai/en/stable/features/tool_calling.html
- https://docs.vllm.ai/en/stable/features/reasoning_outputs.html
You can see in my fork, how I run both a local user and assistant agent with vLLM: https://github.com/huggingface/tau2-bench/blob/trl-internal/run_tau2_local.sh
A potential implementation using inspect: https://github.com/groq/openbench/pull/294