Discover and explore top open-source AI tools and projects—updated daily.
harbor-frameworkAI agent benchmark for terminal environments
Top 92.5% on SourcePulse
Terminal-Bench 2.0 (TB 2.0) is a benchmark suite designed to measure the capabilities of AI agents and language models in performing valuable work within containerized terminal environments. It addresses the increasing sophistication of AI by introducing harder, more realistic tasks, such as protein assembly and security vulnerability resolution, to evaluate frontier AI capabilities. The project benefits researchers and labs by providing a standardized, high-quality evaluation framework for agent performance.
How It Works
TB 2.0 leverages the new Harbor framework, a complete rewrite of the original evaluation harness, engineered for enhanced reliability, observability, scalability, and performance. The benchmark's core approach involves executing tasks within Docker containers, with TB 2.0 featuring tasks that have undergone extensive human and LM-assisted validation to ensure they are solvable, realistic, and well-specified. This focus on task quality aims to push the frontier of agent evaluation.
Quick Start & Requirements
Installation is straightforward via Harbor: uv tool install harbor or pip install harbor. To run the benchmark, use uv run harbor run --dataset terminal-bench@2.0 --agent oracle --n-concurrent 4. This command automatically downloads tasks. Running specific agents or models, such as anthropic/claude-opus-4-1, requires setting relevant API keys (e.g., export ANTHROPIC_API_KEY=<YOUR-KEY>). Task details are available at https://github.com/laude-institute/terminal-bench-2.
Highlighted Details
Maintenance & Community
For questions and support, users are directed to the project's Discord server at https://discord.gg/6xWPKhGDbA, with a specific focus on the #tb-2 channel.
Licensing & Compatibility
The provided README does not specify the software license or any compatibility notes for commercial use or closed-source linking.
Limitations & Caveats
The Harbor framework is described as a new release, slated for "general consumption later this month," suggesting it may be in an early access or beta phase. The project also implies a migration from a legacy Terminal-Bench harness, which may be deprecated.
1 month ago
Inactive
TheAgentCompany