Benchmark for tool-agent-user interaction research
Top 49.3% on sourcepulse
τ-bench provides a benchmark for evaluating tool-agent-user interactions in real-world domains like airline and retail. It targets researchers and developers building AI agents that leverage tools, offering a standardized way to measure performance across different agent strategies and user simulation methods. The benchmark aims to facilitate the development of more robust and capable AI agents by providing quantitative performance metrics and tools for error analysis.
How It Works
τ-bench simulates user interactions with AI agents that utilize external tools. It supports various agent strategies (e.g., tool-calling, ReAct) and user simulation strategies (e.g., LLM, ReAct, verify, reflection). The core mechanism involves running predefined tasks, observing agent-tool-user interactions, and evaluating success rates based on task completion. The benchmark includes an auto error identification tool that leverages LLMs to classify faults (e.g., wrong tool usage, incorrect arguments) and assign responsibility (user, agent, environment).
Quick Start & Requirements
pip install -e .
after cloning the repository.python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10
Highlighted Details
Maintenance & Community
Issues and pull requests can be submitted directly to the repository. Contact information is available via the GitHub repository.
Licensing & Compatibility
The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The auto error identification feature relies on LLMs and may produce inaccurate results. The benchmark was recently rewritten, potentially requiring reruns of historical data if structure changes occur. Performance metrics for some models (e.g., Mistral-large, GPT-4o-mini) are marked as '??', indicating incomplete or unavailable data.
2 weeks ago
1 week