harbor by laude-institute

Harness for AI agent evaluation and RL optimization

Created 6 months ago

723 stars

Top 47.6% on SourcePulse

View on GitHub

7 Experts Love This Project

Andy Konwinski

Cofounder of Perplexity, Databricks

Logan Kilpatrick

Product Lead on Google AI Studio

Philipp Moritz

Cofounder of Anyscale

Shyamal Anadkat

Research Scientist at OpenAI

and 3 more!

Project Summary

Harbor is a framework designed for evaluating and optimizing AI agents and language models, enabling users to create and utilize Reinforcement Learning (RL) environments. It serves as the official harness for benchmarks like Terminal-Bench 2.0, targeting researchers and engineers who need to streamline agent performance assessment and experimentation. Harbor simplifies the process of benchmarking, parallel execution, and RL rollout generation.

How It Works

The framework acts as a unified interface for running agent evaluations against diverse benchmarks. It supports arbitrary agents such as Claude Code, OpenHands, and Codex CLI. Harbor's architecture facilitates building and sharing custom benchmarks and environments, and crucially, enables massively parallel experimentation through integrations with cloud providers like Daytona and Modal. This parallelization is key for efficient RL optimization and large-scale testing.

Quick Start & Requirements

Installation is straightforward via uv tool install harbor or pip install harbor. Running benchmarks typically requires setting environment variables for API keys (e.g., ANTHROPIC_API_KEY, DAYTONA_API_KEY). Local execution utilizes Docker, while cloud providers like Daytona can be leveraged for scaled, parallel runs by specifying the --env flag. Use harbor run --help for command options and harbor datasets list to explore available benchmarks.

Highlighted Details

Official harness for Terminal-Bench 2.0.
Evaluates a wide range of agents including Claude Code, OpenHands, and Codex CLI.
Supports parallel execution of up to 100 concurrent experiments via cloud providers like Daytona.
Integrates with third-party benchmarks such as SWE-Bench and Aider Polyglot.
Facilitates the generation of rollouts for RL optimization tasks.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (like Discord or Slack), or a public roadmap.

Licensing & Compatibility

The README does not explicitly state the project's license type or provide compatibility notes for commercial use.

Limitations & Caveats

The README does not detail any specific limitations, known bugs, alpha status, or unsupported platforms for the Harbor framework.

Health Check

Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)