terminal-bench-2  by harbor-framework

AI agent benchmark for terminal environments

Created 8 months ago
282 stars

Top 92.5% on SourcePulse

GitHubView on GitHub
Project Summary

Terminal-Bench 2.0 (TB 2.0) is a benchmark suite designed to measure the capabilities of AI agents and language models in performing valuable work within containerized terminal environments. It addresses the increasing sophistication of AI by introducing harder, more realistic tasks, such as protein assembly and security vulnerability resolution, to evaluate frontier AI capabilities. The project benefits researchers and labs by providing a standardized, high-quality evaluation framework for agent performance.

How It Works

TB 2.0 leverages the new Harbor framework, a complete rewrite of the original evaluation harness, engineered for enhanced reliability, observability, scalability, and performance. The benchmark's core approach involves executing tasks within Docker containers, with TB 2.0 featuring tasks that have undergone extensive human and LM-assisted validation to ensure they are solvable, realistic, and well-specified. This focus on task quality aims to push the frontier of agent evaluation.

Quick Start & Requirements

Installation is straightforward via Harbor: uv tool install harbor or pip install harbor. To run the benchmark, use uv run harbor run --dataset terminal-bench@2.0 --agent oracle --n-concurrent 4. This command automatically downloads tasks. Running specific agents or models, such as anthropic/claude-opus-4-1, requires setting relevant API keys (e.g., export ANTHROPIC_API_KEY=<YOUR-KEY>). Task details are available at https://github.com/laude-institute/terminal-bench-2.

Highlighted Details

  • Features harder, more realistic tasks than its predecessor, designed to test frontier AI capabilities.
  • Tasks are rigorously validated through human and LM-assisted processes for quality, solvability, and clear specification.
  • Employs the new Harbor framework for improved evaluation reliability, observability, scalability, and performance.
  • Supports integration with various third-party agents and models.

Maintenance & Community

For questions and support, users are directed to the project's Discord server at https://discord.gg/6xWPKhGDbA, with a specific focus on the #tb-2 channel.

Licensing & Compatibility

The provided README does not specify the software license or any compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

The Harbor framework is described as a new release, slated for "general consumption later this month," suggesting it may be in an early access or beta phase. The project also implies a migration from a legacy Terminal-Bench harness, which may be deprecated.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
51 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.