terminal-bench  by laude-institute

Benchmark for LLM agents in real terminal environments

Created 1 year ago
1,413 stars

Top 28.6% on SourcePulse

GitHubView on GitHub
Project Summary

Terminal-Bench provides a benchmark for evaluating AI agents on complex, real-world terminal tasks. It targets developers building LLM agents, benchmarking frameworks, or testing system-level reasoning, offering a reproducible suite for practical, end-to-end performance assessment.

How It Works

The project comprises a dataset of tasks and an execution harness. Each task includes an English instruction, a verification script, and an oracle solution. The harness connects language models to a sandboxed terminal environment, enabling autonomous execution and evaluation of agent capabilities in text-based interactions.

Quick Start & Requirements

Highlighted Details

  • Beta release with ~100 tasks, with plans for expansion.
  • Leaderboard available for submitting agent evaluations.
  • Supports custom task creation and contribution.
  • Includes bibtex citation for academic use.

Maintenance & Community

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial and closed-source applications.

Limitations & Caveats

The project is currently in beta, with ongoing expansion of its task dataset.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
19
Issues (30d)
4
Star History
163 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.