tau-bench by sierra-research

Benchmark for tool-agent-user interaction research

Created 1 year ago

1,048 stars

Top 36.0% on SourcePulse

View on GitHub

4 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Simon Willison

Coauthor of Django

Pawel Garbacki

Cofounder of Fireworks AI

Travis Fischer

Founder of Agentic

Project Summary

τ-bench provides a benchmark for evaluating tool-agent-user interactions in real-world domains like airline and retail. It targets researchers and developers building AI agents that leverage tools, offering a standardized way to measure performance across different agent strategies and user simulation methods. The benchmark aims to facilitate the development of more robust and capable AI agents by providing quantitative performance metrics and tools for error analysis.

How It Works

τ-bench simulates user interactions with AI agents that utilize external tools. It supports various agent strategies (e.g., tool-calling, ReAct) and user simulation strategies (e.g., LLM, ReAct, verify, reflection). The core mechanism involves running predefined tasks, observing agent-tool-user interactions, and evaluating success rates based on task completion. The benchmark includes an auto error identification tool that leverages LLMs to classify faults (e.g., wrong tool usage, incorrect arguments) and assign responsibility (user, agent, environment).

Quick Start & Requirements

Install: pip install -e . after cloning the repository.
Prerequisites: API keys for OpenAI, Anthropic, Google, Mistral, or AnyScale must be set as environment variables.
Run: python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10
Docs: https://arxiv.org/abs/2406.12045

Highlighted Details

Benchmarks performance of various LLMs (e.g., GPT-4o, Claude 3.5 Sonnet) on airline and retail tasks.
Supports multiple agent and user simulation strategies for comprehensive evaluation.
Includes an LLM-powered auto error identification tool for fault analysis.
Provides historical trajectories for faster iteration.

Maintenance & Community

Issues and pull requests can be submitted directly to the repository. Contact information is available via the GitHub repository.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The auto error identification feature relies on LLMs and may produce inaccurate results. The benchmark was recently rewritten, potentially requiring reruns of historical data if structure changes occur. Performance metrics for some models (e.g., Mistral-large, GPT-4o-mini) are marked as '??', indicating incomplete or unavailable data.

Health Check

Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

50 stars in the last 30 days