tau-bench  by sierra-research

Benchmark for tool-agent-user interaction research

created 1 year ago
709 stars

Top 49.3% on sourcepulse

GitHubView on GitHub
Project Summary

τ-bench provides a benchmark for evaluating tool-agent-user interactions in real-world domains like airline and retail. It targets researchers and developers building AI agents that leverage tools, offering a standardized way to measure performance across different agent strategies and user simulation methods. The benchmark aims to facilitate the development of more robust and capable AI agents by providing quantitative performance metrics and tools for error analysis.

How It Works

τ-bench simulates user interactions with AI agents that utilize external tools. It supports various agent strategies (e.g., tool-calling, ReAct) and user simulation strategies (e.g., LLM, ReAct, verify, reflection). The core mechanism involves running predefined tasks, observing agent-tool-user interactions, and evaluating success rates based on task completion. The benchmark includes an auto error identification tool that leverages LLMs to classify faults (e.g., wrong tool usage, incorrect arguments) and assign responsibility (user, agent, environment).

Quick Start & Requirements

  • Install: pip install -e . after cloning the repository.
  • Prerequisites: API keys for OpenAI, Anthropic, Google, Mistral, or AnyScale must be set as environment variables.
  • Run: python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10
  • Docs: https://arxiv.org/abs/2406.12045

Highlighted Details

  • Benchmarks performance of various LLMs (e.g., GPT-4o, Claude 3.5 Sonnet) on airline and retail tasks.
  • Supports multiple agent and user simulation strategies for comprehensive evaluation.
  • Includes an LLM-powered auto error identification tool for fault analysis.
  • Provides historical trajectories for faster iteration.

Maintenance & Community

Issues and pull requests can be submitted directly to the repository. Contact information is available via the GitHub repository.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The auto error identification feature relies on LLMs and may produce inaccurate results. The benchmark was recently rewritten, potentially requiring reruns of historical data if structure changes occur. Performance metrics for some models (e.g., Mistral-large, GPT-4o-mini) are marked as '??', indicating incomplete or unavailable data.

Health Check
Last commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
2
Issues (30d)
4
Star History
258 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.