harbor  by laude-institute

Harness for AI agent evaluation and RL optimization

Created 5 months ago
322 stars

Top 84.5% on SourcePulse

GitHubView on GitHub
Project Summary

Harbor is a framework designed for evaluating and optimizing AI agents and language models, enabling users to create and utilize Reinforcement Learning (RL) environments. It serves as the official harness for benchmarks like Terminal-Bench 2.0, targeting researchers and engineers who need to streamline agent performance assessment and experimentation. Harbor simplifies the process of benchmarking, parallel execution, and RL rollout generation.

How It Works

The framework acts as a unified interface for running agent evaluations against diverse benchmarks. It supports arbitrary agents such as Claude Code, OpenHands, and Codex CLI. Harbor's architecture facilitates building and sharing custom benchmarks and environments, and crucially, enables massively parallel experimentation through integrations with cloud providers like Daytona and Modal. This parallelization is key for efficient RL optimization and large-scale testing.

Quick Start & Requirements

Installation is straightforward via uv tool install harbor or pip install harbor. Running benchmarks typically requires setting environment variables for API keys (e.g., ANTHROPIC_API_KEY, DAYTONA_API_KEY). Local execution utilizes Docker, while cloud providers like Daytona can be leveraged for scaled, parallel runs by specifying the --env flag. Use harbor run --help for command options and harbor datasets list to explore available benchmarks.

Highlighted Details

  • Official harness for Terminal-Bench 2.0.
  • Evaluates a wide range of agents including Claude Code, OpenHands, and Codex CLI.
  • Supports parallel execution of up to 100 concurrent experiments via cloud providers like Daytona.
  • Integrates with third-party benchmarks such as SWE-Bench and Aider Polyglot.
  • Facilitates the generation of rollouts for RL optimization tasks.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (like Discord or Slack), or a public roadmap.

Licensing & Compatibility

The README does not explicitly state the project's license type or provide compatibility notes for commercial use.

Limitations & Caveats

The README does not detail any specific limitations, known bugs, alpha status, or unsupported platforms for the Harbor framework.

Health Check
Last Commit

15 hours ago

Responsiveness

Inactive

Pull Requests (30d)
232
Issues (30d)
27
Star History
145 stars in the last 30 days

Explore Similar Projects

Starred by Will Brown Will Brown(Research Lead at Prime Intellect) and Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research).

hud-python by hud-evals

3.3%
257
AI agent development and evaluation toolkit
Created 10 months ago
Updated 20 hours ago
Feedback? Help us improve.