lighteval [EVAL] Add tau-bench

Evaluation short description

Similar to BFCL in #873, tau-bench is a popular agentic benchmark that is used to measure the ability of LLMs to use tools in real world domains (e.g. booking flights). It is very popular and reported in modern releases, including those from frontier labs.

We could focus on the τ^2 variant that is more recent.

Evaluation metadata

Provide all available

Paper url: https://arxiv.org/abs/2406.12045
Github url: https://github.com/sierra-research/tau-bench
Dataset url: simulated

Jul 24 '25 22:07 lewtun

Hi, would it be okay if an external contributor worked on this task?

Sep 08 '25 10:09 yijun-lee

We'd be delighted! :)

Sep 08 '25 11:09 clefourrier

Hi @yijun-lee just FYI that the tau2_bench data is stored directly in their GitHub repo, so I created a copy on the Hub that can be downloaded at runtime: https://huggingface.co/datasets/HuggingFaceH4/tau2-bench-data

Note that this is a very complex benchmark and that for the vLLM backend, one must specify a specific tool-calling and reasoning parser to get correct results:

https://docs.vllm.ai/en/stable/features/tool_calling.html
https://docs.vllm.ai/en/stable/features/reasoning_outputs.html

You can see in my fork, how I run both a local user and assistant agent with vLLM: https://github.com/huggingface/tau2-bench/blob/trl-internal/run_tau2_local.sh

Sep 08 '25 14:09 lewtun

A potential implementation using inspect: https://github.com/groq/openbench/pull/294

Nov 10 '25 16:11 xeophon