lighteval icon indicating copy to clipboard operation
lighteval copied to clipboard

[EVAL] Add tau-bench

Open lewtun opened this issue 6 months ago • 4 comments

Evaluation short description

Similar to BFCL in #873, tau-bench is a popular agentic benchmark that is used to measure the ability of LLMs to use tools in real world domains (e.g. booking flights). It is very popular and reported in modern releases, including those from frontier labs.

We could focus on the τ^2 variant that is more recent.

Evaluation metadata

Provide all available

  • Paper url: https://arxiv.org/abs/2406.12045
  • Github url: https://github.com/sierra-research/tau-bench
  • Dataset url: simulated

lewtun avatar Jul 24 '25 22:07 lewtun

Hi, would it be okay if an external contributor worked on this task?

yijun-lee avatar Sep 08 '25 10:09 yijun-lee

We'd be delighted! :)

clefourrier avatar Sep 08 '25 11:09 clefourrier

Hi @yijun-lee just FYI that the tau2_bench data is stored directly in their GitHub repo, so I created a copy on the Hub that can be downloaded at runtime: https://huggingface.co/datasets/HuggingFaceH4/tau2-bench-data

Note that this is a very complex benchmark and that for the vLLM backend, one must specify a specific tool-calling and reasoning parser to get correct results:

  • https://docs.vllm.ai/en/stable/features/tool_calling.html
  • https://docs.vllm.ai/en/stable/features/reasoning_outputs.html

You can see in my fork, how I run both a local user and assistant agent with vLLM: https://github.com/huggingface/tau2-bench/blob/trl-internal/run_tau2_local.sh

lewtun avatar Sep 08 '25 14:09 lewtun

A potential implementation using inspect: https://github.com/groq/openbench/pull/294

xeophon avatar Nov 10 '25 16:11 xeophon