terminal-bench by harbor-framework

Benchmark for LLM agents in real terminal environments

Created 1 year ago

1,701 stars

Top 24.6% on SourcePulse

11 Experts Love This Project

gaearon

Core Contributor to React; Coauthor of Redux, Create React App

ogabrielluiz

Gabriel Almeida

Cofounder of Langflow

transitive-bullshit

Founder of Agentic

vincentweisser

Vincent Weisser

Cofounder of Prime Intellect

and 7 more!

Project Summary

Terminal-Bench provides a benchmark for evaluating AI agents on complex, real-world terminal tasks. It targets developers building LLM agents, benchmarking frameworks, or testing system-level reasoning, offering a reproducible suite for practical, end-to-end performance assessment.

How It Works

The project comprises a dataset of tasks and an execution harness. Each task includes an English instruction, a verification script, and an oracle solution. The harness connects language models to a sandboxed terminal environment, enabling autonomous execution and evaluation of agent capabilities in text-based interactions.

Quick Start & Requirements

Install via pip: pip install terminal-bench or uv tool install terminal-bench.
Prerequisites: uv, Docker.
Documentation: Task Gallery, Dashboard Documentation.

Highlighted Details

Beta release with ~100 tasks, with plans for expansion.
Leaderboard available for submitting agent evaluations.
Supports custom task creation and contribution.
Includes bibtex citation for academic use.

Maintenance & Community

Open to community contributions, especially for new tasks.
Leaderboard Submission Guide.

Licensing & Compatibility

License: Apache 2.0.
Compatible with commercial and closed-source applications.

Limitations & Caveats

The project is currently in beta, with ongoing expansion of its task dataset.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

7

Issues (30d)

4

Star History

173 stars in the last 30 days

Explore Similar Projects

Starred by

Varun Mohan

Varun Mohan(Cofounder of Windsurf).

windsurf-antigravity-rules by kinopeee

Custom AI agent instructions for enhanced development workflows

Created 1 year ago

Updated 3 months ago

mcpmark by eval-sys

Evaluate agentic models across diverse real-world tool environments

Created 8 months ago

Updated 1 month ago

skill by pinchbench

Benchmarking system for AI coding agents

Created 1 month ago

Updated 1 day ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Simon Willison

Simon Willison(Coauthor of Django).

SWE-bench_Pro-os by scaleapi

AI agents for long-horizon software engineering tasks

Created 6 months ago

Updated 2 days ago

Software-planning-mcp by NightTrek

Software planning assistant

Created 1 year ago

Updated 1 year ago

Starred by

Michael Chiang

Michael Chiang(Cofounder of Ollama),

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI), and

6 more.

openbench by groq

Provider-agnostic LLM evaluation infrastructure

Created 7 months ago

Updated 1 day ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django),

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect), and

2 more.

TheAgentCompany by TheAgentCompany

Agent benchmark for real-world professional tasks in a simulated software company

Created 2 years ago

Updated 3 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic),

Meng Zhang

Meng Zhang(Cofounder of TabbyML), and

1 more.

augment-swebench-agent by augmentcode

Coding agent for SWE-bench evaluation

Created 11 months ago

Updated 9 months ago

mcp-shrimp-task-manager by cjo4m06

Task manager for AI agents, enabling structured dev workflows

Created 11 months ago

Updated 6 months ago

Starred by

Jason Huggins

Jason Huggins(Creator of Selenium),

Philipp Moritz

Philipp Moritz(Cofounder of Anyscale), and

6 more.

mini-swe-agent by SWE-agent

AI agent for solving GitHub issues and command-line tasks

Created 8 months ago

Updated 1 day ago

Starred by

Saoud Rizwan

Saoud Rizwan(Founder of Cline).

ccpm by automazeio

AI project management for parallel agent execution

Created 6 months ago

Updated 5 months ago

joyagent-jdgenie by jd-opensource

End-to-end multi-agent product for task completion

Created 8 months ago

Updated 4 weeks ago

Feedback? Help us improve.