terminal-bench  by laude-institute

Benchmark for LLM agents in real terminal environments

Created 10 months ago
1,137 stars

Top 33.7% on SourcePulse

GitHubView on GitHub
Project Summary

Terminal-Bench provides a benchmark for evaluating AI agents on complex, real-world terminal tasks. It targets developers building LLM agents, benchmarking frameworks, or testing system-level reasoning, offering a reproducible suite for practical, end-to-end performance assessment.

How It Works

The project comprises a dataset of tasks and an execution harness. Each task includes an English instruction, a verification script, and an oracle solution. The harness connects language models to a sandboxed terminal environment, enabling autonomous execution and evaluation of agent capabilities in text-based interactions.

Quick Start & Requirements

Highlighted Details

  • Beta release with ~100 tasks, with plans for expansion.
  • Leaderboard available for submitting agent evaluations.
  • Supports custom task creation and contribution.
  • Includes bibtex citation for academic use.

Maintenance & Community

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial and closed-source applications.

Limitations & Caveats

The project is currently in beta, with ongoing expansion of its task dataset.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
18
Issues (30d)
5
Star History
145 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.