Discover and explore top open-source AI tools and projects—updated daily.
laude-instituteHarness for AI agent evaluation and RL optimization
Top 84.5% on SourcePulse
Harbor is a framework designed for evaluating and optimizing AI agents and language models, enabling users to create and utilize Reinforcement Learning (RL) environments. It serves as the official harness for benchmarks like Terminal-Bench 2.0, targeting researchers and engineers who need to streamline agent performance assessment and experimentation. Harbor simplifies the process of benchmarking, parallel execution, and RL rollout generation.
How It Works
The framework acts as a unified interface for running agent evaluations against diverse benchmarks. It supports arbitrary agents such as Claude Code, OpenHands, and Codex CLI. Harbor's architecture facilitates building and sharing custom benchmarks and environments, and crucially, enables massively parallel experimentation through integrations with cloud providers like Daytona and Modal. This parallelization is key for efficient RL optimization and large-scale testing.
Quick Start & Requirements
Installation is straightforward via uv tool install harbor or pip install harbor. Running benchmarks typically requires setting environment variables for API keys (e.g., ANTHROPIC_API_KEY, DAYTONA_API_KEY). Local execution utilizes Docker, while cloud providers like Daytona can be leveraged for scaled, parallel runs by specifying the --env flag. Use harbor run --help for command options and harbor datasets list to explore available benchmarks.
Highlighted Details
Maintenance & Community
The provided README does not contain specific details regarding maintainers, community channels (like Discord or Slack), or a public roadmap.
Licensing & Compatibility
The README does not explicitly state the project's license type or provide compatibility notes for commercial use.
Limitations & Caveats
The README does not detail any specific limitations, known bugs, alpha status, or unsupported platforms for the Harbor framework.
15 hours ago
Inactive
facebookresearch
hud-evals