hal-harness by princeton-pli

Standardized harness for reproducible AI agent evaluations

Created 1 year ago

304 stars

Top 87.7% on SourcePulse

Project Summary

Holistic Agent Leaderboard (HAL) provides a standardized evaluation harness for reproducible AI agent benchmarking. It addresses the critical need for cost-aware evaluations and meaningful comparisons across diverse benchmarks, targeting researchers and engineers developing AI agents. HAL enables users to run agents consistently, track costs, and share results, fostering transparency and accelerating agent development.

How It Works

The core of HAL is a unified hal-eval CLI that abstracts away benchmark-specific execution details, allowing agents to be tested across supported platforms like SWE-bench, USACO, and AppWorld. It supports flexible execution environments, including isolated local setups via Conda or Docker, and scalable cloud deployments on Azure VMs. Integrations with Weave for detailed cost and usage metrics, and HuggingFace for secure agent trace uploads, ensure comprehensive monitoring and result sharing. This approach prioritizes agent implementation flexibility and reproducible, cost-conscious evaluation.

Quick Start & Requirements

Setup involves cloning the repository (git clone --recursive), creating a Conda environment (conda create -n hal python=3.12), activating it (conda activate hal), and installing the package (pip install -e .). Users must configure API keys (HuggingFace, Weave, LLMs) in a .env file and install model provider SDKs (e.g., pip install openai). Optional Docker or Azure VM support requires additional setup and dependencies.

Highlighted Details

Supports a wide array of benchmarks including SWE-bench Verified, USACO, AppWorld, CORE-bench, tau-bench, SciCode, AssistantBench, ScienceAgentBench, and CollaborativeAgentBench.
Features a unified CLI (hal-eval) for cross-benchmark agent evaluation and parallelized execution capabilities.
Integrates Weave for detailed cost tracking and usage metrics, alongside automatic encryption and upload of agent traces to HuggingFace Hub.
Enables local, Docker-containerized, or Azure VM-based evaluation environments for flexibility and isolation.

Maintenance & Community

The project is associated with Princeton University's Pli group, as indicated by the citation. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the provided README.

Licensing & Compatibility

The README does not explicitly state a software license. Given the academic affiliation and citation format, users should verify licensing terms for commercial or closed-source integration.

Limitations & Caveats

The SWE-bench Verified (Mini) benchmark is explicitly noted as incompatible with arm64 architectures (e.g., Mac M chips). Several benchmarks require manual dataset downloads, specific dependency installations, or decryption of sensitive files (e.g., CORE-bench GPG).

hal-harness by princeton-pli

Explore Similar Projects

c2sagent by C2SAgent

motus by lithos-ai

agents-last-exam by rdi-berkeley

sandstorm by tomascupr

deep-swe by datacurve-ai

agents-at-scale-ark by mckinsey

WindowsAgentArena by microsoft

Wegent by wecode-ai

openclaw-mission-control by robsannaa

Auto-GPT-ZH by kaqijiang

docker-agent by docker

agentops by AgentOps-AI