terminal-bench-2 by harbor-framework

AI agent benchmark for terminal environments

Created 10 months ago

340 stars

Top 80.8% on SourcePulse

Project Summary

Terminal-Bench 2.0 (TB 2.0) is a benchmark suite designed to measure the capabilities of AI agents and language models in performing valuable work within containerized terminal environments. It addresses the increasing sophistication of AI by introducing harder, more realistic tasks, such as protein assembly and security vulnerability resolution, to evaluate frontier AI capabilities. The project benefits researchers and labs by providing a standardized, high-quality evaluation framework for agent performance.

How It Works

TB 2.0 leverages the new Harbor framework, a complete rewrite of the original evaluation harness, engineered for enhanced reliability, observability, scalability, and performance. The benchmark's core approach involves executing tasks within Docker containers, with TB 2.0 featuring tasks that have undergone extensive human and LM-assisted validation to ensure they are solvable, realistic, and well-specified. This focus on task quality aims to push the frontier of agent evaluation.

Quick Start & Requirements

Installation is straightforward via Harbor: uv tool install harbor or pip install harbor. To run the benchmark, use uv run harbor run --dataset terminal-bench@2.0 --agent oracle --n-concurrent 4. This command automatically downloads tasks. Running specific agents or models, such as anthropic/claude-opus-4-1, requires setting relevant API keys (e.g., export ANTHROPIC_API_KEY=<YOUR-KEY>). Task details are available at https://github.com/laude-institute/terminal-bench-2.

Highlighted Details

Features harder, more realistic tasks than its predecessor, designed to test frontier AI capabilities.
Tasks are rigorously validated through human and LM-assisted processes for quality, solvability, and clear specification.
Employs the new Harbor framework for improved evaluation reliability, observability, scalability, and performance.
Supports integration with various third-party agents and models.

Maintenance & Community

For questions and support, users are directed to the project's Discord server at https://discord.gg/6xWPKhGDbA, with a specific focus on the #tb-2 channel.

Licensing & Compatibility

The provided README does not specify the software license or any compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

The Harbor framework is described as a new release, slated for "general consumption later this month," suggesting it may be in an early access or beta phase. The project also implies a migration from a legacy Terminal-Bench harness, which may be deprecated.

terminal-bench-2 by harbor-framework

Explore Similar Projects

GitTaskBench by QuantaAlpha

agent-skills-eval by darkrishabh

smithers by smithersai

agents-last-exam by rdi-berkeley

auto-harness by neosigmaai

claw-eval by claw-eval

deep-swe by datacurve-ai

TheAgentCompany by TheAgentCompany

agentsre by Ajay150313

agent-md by iamfakeguru

frontier-bench by harbor-framework

agentops by AgentOps-AI