terminal-bench  by laude-institute

Benchmark for LLM agents in real terminal environments

created 7 months ago
393 stars

Top 73.1% on SourcePulse

GitHubView on GitHub
Project Summary

Terminal-Bench provides a benchmark for evaluating AI agents on complex, real-world terminal tasks. It targets developers building LLM agents, benchmarking frameworks, or testing system-level reasoning, offering a reproducible suite for practical, end-to-end performance assessment.

How It Works

The project comprises a dataset of tasks and an execution harness. Each task includes an English instruction, a verification script, and an oracle solution. The harness connects language models to a sandboxed terminal environment, enabling autonomous execution and evaluation of agent capabilities in text-based interactions.

Quick Start & Requirements

Highlighted Details

  • Beta release with ~100 tasks, with plans for expansion.
  • Leaderboard available for submitting agent evaluations.
  • Supports custom task creation and contribution.
  • Includes bibtex citation for academic use.

Maintenance & Community

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial and closed-source applications.

Limitations & Caveats

The project is currently in beta, with ongoing expansion of its task dataset.

Health Check
Last commit

21 hours ago

Responsiveness

Inactive

Pull Requests (30d)
92
Issues (30d)
26
Star History
145 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.