tau2-bench  by sierra-research

Framework for evaluating conversational agents in dual-control environments

Created 3 months ago
301 stars

Top 88.6% on SourcePulse

GitHubView on GitHub
Project Summary

τ²-Bench is a simulation framework designed for evaluating customer service conversational agents across various domains like airline, retail, and telecom. It provides a dual-control environment where both the agent and a user simulator interact, allowing for rigorous performance assessment. This framework benefits agent developers by offering a standardized method to test and benchmark their agents' capabilities in realistic, simulated scenarios.

How It Works

τ²-Bench operates by defining specific policies, tools, and tasks for each domain. An orchestrator manages the conversation flow, passing messages between the agent, a user simulator, and the environment. The agent can utilize a set of provided tools to interact with the environment, while the user simulator mimics real user behavior. This setup enables the evaluation of agent performance based on adherence to policies and task completion success.

Quick Start & Requirements

  • Primary install: pip install -e . (after cloning the repository)
  • Requirements: Python 3.10 or higher. LiteLLM is used for LLM API management, requiring API keys to be configured in a .env file.
  • Verification: Run tau2 check-data to ensure data directory setup.
  • Documentation: Domain-specific API docs are available via tau2 domain <domain> and visiting http://127.0.0.1:8004/redoc.

Highlighted Details

  • Supports multiple domains (mock, airline, retail, telecom) with customizable policies and tasks.
  • Includes an Environment CLI (beta) for interactive querying and testing of domain tools.
  • Offers modes for ablation studies, such as agent-llm solo or with an oracle plan (llm_agent_gt).
  • LLM call caching can be enabled by configuring Redis and setting LLM_CACHE_ENABLED to True.

Maintenance & Community

  • The project is hosted on GitHub at https://github.com/sierra-research/tau2-bench.
  • Citation details are available in BibTeX format, referencing arXiv:2506.07982.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • The Environment CLI is noted as being in beta.
  • The README does not specify a license, which may impact adoption and compatibility decisions.
Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
24
Issues (30d)
5
Star History
103 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
1 more.

TinyTroupe by microsoft

0.2%
7k
LLM-powered multiagent simulation for business insights and imagination
Created 1 year ago
Updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Magnus Müller Magnus Müller(Cofounder of Browser Use), and
83 more.

langchain by langchain-ai

0.4%
116k
Framework for building LLM-powered applications
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.