terminal-bench-rl by Danau5tin

Train long-horizon terminal agents with scalable RL

Created 4 months ago

303 stars

Top 88.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jasper Zhang

Cofounder of Hyperbolic

Project Summary

Terminal-Bench-RL: Scalable Reinforcement Learning for Terminal Agents

This project addresses the significant computational challenges and infrastructure requirements for training long-horizon, terminal-based coding agents. It provides a robust, scalable reinforcement learning (RL) training framework capable of utilizing up to 32x H100 GPUs, enabling the development of advanced agents for complex terminal and coding tasks. The primary benefit is a ready-to-use infrastructure and a high-performing agent baseline that has demonstrated strong results on the Stanford TerminalBench leaderboard.

How It Works

The project extends the rLLM framework, integrating custom environments (DockerIsolatedEnv) and agents (TerminalBenchAgent) tailored for terminal interactions. Training employs Group Relative Policy Optimization (GRPO). Agents communicate via a structured XML/YAML format, defining actions like bash commands, file operations, and task management, ensuring type safety and error recovery. Reward signals are generated through a dual system: 65% from Python unit tests verifying task completion and 35% from an LLM-as-a-Judge evaluating agent behavior, planning, and tool usage.

Quick Start & Requirements

Clone the repository with submodules (git clone --recurse-submodules) and install dependencies using uv sync. Key requirements include Python 3.12 (due to a fork of terminal-bench) and Docker for isolated execution environments. While evaluation setup is accessible, full-scale training necessitates substantial GPU resources (tested up to 32x H100s), with estimated compute costs ranging from £30k-£50k. Links to single-node and multi-node training guides are provided within the documentation.

Highlighted Details

Scalability: Successfully tested training infrastructure across 4 bare-metal nodes using 32x H100 GPUs.
Leaderboard Performance: The Qwen3-32B agent achieved top scores among Qwen3 agents on the Stanford TerminalBench leaderboard, outperforming GPT-4.1 (Codex) and Deepseek R1 agents.
Action-Based Architecture: Implements a robust toolset (e.g., file operations, search, bash execution, todo management) with structured XML/YAML communication for reliable agent interaction.

Maintenance & Community

The project appears to be primarily maintained by the author, Danau5tin. Future improvements are outlined, including implementing curriculum learning, expanding the dataset, and refining data filtering. No specific community channels (like Discord or Slack) or external partnerships are mentioned.

Licensing & Compatibility

The specific open-source license for this repository is not explicitly stated in the provided README content.

Limitations & Caveats

A significant limitation is the lack of a full RL training run due to prohibitive compute costs (£30k-£50k), meaning the agent's performance could be substantially improved with adequate resources. The optimal LLM judge (Claude Sonnet 4) is also expensive for extensive use. The current dataset comprises approximately 331 tasks, and expansion is recommended for more comprehensive training.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days