Benchmark for LLM agents in real terminal environments
Top 73.1% on SourcePulse
Terminal-Bench provides a benchmark for evaluating AI agents on complex, real-world terminal tasks. It targets developers building LLM agents, benchmarking frameworks, or testing system-level reasoning, offering a reproducible suite for practical, end-to-end performance assessment.
How It Works
The project comprises a dataset of tasks and an execution harness. Each task includes an English instruction, a verification script, and an oracle solution. The harness connects language models to a sandboxed terminal environment, enabling autonomous execution and evaluation of agent capabilities in text-based interactions.
Quick Start & Requirements
pip install terminal-bench
or uv tool install terminal-bench
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is currently in beta, with ongoing expansion of its task dataset.
21 hours ago
Inactive