tau2-bench by sierra-research

Framework for evaluating conversational agents in dual-control environments

Created 8 months ago

769 stars

Top 45.4% on SourcePulse

4 Experts Love This Project

shizhediao

Author of LMFlow; Research Scientist at NVIDIA

vincentweisser

Vincent Weisser

Cofounder of Prime Intellect

lewtun

Research Engineer at Hugging Face

t3dotgg

Founder of Ping.gg

Project Summary

τ²-Bench is a simulation framework designed for evaluating customer service conversational agents across various domains like airline, retail, and telecom. It provides a dual-control environment where both the agent and a user simulator interact, allowing for rigorous performance assessment. This framework benefits agent developers by offering a standardized method to test and benchmark their agents' capabilities in realistic, simulated scenarios.

How It Works

τ²-Bench operates by defining specific policies, tools, and tasks for each domain. An orchestrator manages the conversation flow, passing messages between the agent, a user simulator, and the environment. The agent can utilize a set of provided tools to interact with the environment, while the user simulator mimics real user behavior. This setup enables the evaluation of agent performance based on adherence to policies and task completion success.

Quick Start & Requirements

Primary install: pip install -e . (after cloning the repository)
Requirements: Python 3.10 or higher. LiteLLM is used for LLM API management, requiring API keys to be configured in a .env file.
Verification: Run tau2 check-data to ensure data directory setup.
Documentation: Domain-specific API docs are available via tau2 domain <domain> and visiting http://127.0.0.1:8004/redoc.

Highlighted Details

Supports multiple domains (mock, airline, retail, telecom) with customizable policies and tasks.
Includes an Environment CLI (beta) for interactive querying and testing of domain tools.
Offers modes for ablation studies, such as agent-llm solo or with an oracle plan (llm_agent_gt).
LLM call caching can be enabled by configuring Redis and setting LLM_CACHE_ENABLED to True.

Maintenance & Community

The project is hosted on GitHub at https://github.com/sierra-research/tau2-bench.
Citation details are available in BibTeX format, referencing arXiv:2506.07982.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The Environment CLI is noted as being in beta.
The README does not specify a license, which may impact adoption and compatibility decisions.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

11

Issues (30d)

8

Star History

88 stars in the last 30 days

Explore Similar Projects

SimAIWorld by Turing-Project

Simulated worlds with autonomous AI agents

Created 2 years ago

Updated 2 years ago

Starred by

Enes Akar

Enes Akar(Cofounder of Upstash).

Tiger by Upsonic

Tool ecosystem for LLM agents, integrating with LangChain, AutoGen, and CrewAI

Created 1 year ago

Updated 1 year ago

agent-evaluation by awslabs

Framework for testing generative AI virtual agents

Created 1 year ago

Updated 2 months ago

Starred by

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow).

scenario by langwatch

Agent testing framework for simulating user interactions

Created 10 months ago

Updated 1 day ago

langgraph-101 by langchain-ai

Framework for building complex agent and multi-agent applications

Created 1 year ago

Updated 1 week ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect) and

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen).

AgentSims by py499372727

Open-source sandbox for LLM evaluation via task-based agent simulations

Created 2 years ago

Updated 2 years ago

DeepMCPAgent by cryxnet

Model-agnostic agents discover and use HTTP/SSE tools

Created 6 months ago

Updated 4 months ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI) and

Shyamal Anadkat

Shyamal Anadkat(Research Scientist at OpenAI).

awesome-llm-agents by kaushikb11

Curated list of LLM agent frameworks

Created 2 years ago

Updated 2 days ago

rogue by qualifire-dev

AI agent evaluation framework

Created 8 months ago

Updated 1 day ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI),

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect), and

1 more.

AgentVerse by OpenBMB

Multi-agent framework for LLM deployment in task-solving/simulation

Created 2 years ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

1 more.

TinyTroupe by microsoft

LLM-powered multiagent simulation for business insights and imagination

Created 1 year ago

Updated 4 days ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Magnus Müller

Magnus Müller(Cofounder of Browser Use), and

86 more.

langchain by langchain-ai

Framework for building LLM-powered applications

Created 3 years ago

Updated 22 hours ago

Feedback? Help us improve.