hal-harness  by princeton-pli

Standardized harness for reproducible AI agent evaluations

Created 1 year ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

Holistic Agent Leaderboard (HAL) provides a standardized evaluation harness for reproducible AI agent benchmarking. It addresses the critical need for cost-aware evaluations and meaningful comparisons across diverse benchmarks, targeting researchers and engineers developing AI agents. HAL enables users to run agents consistently, track costs, and share results, fostering transparency and accelerating agent development.

How It Works

The core of HAL is a unified hal-eval CLI that abstracts away benchmark-specific execution details, allowing agents to be tested across supported platforms like SWE-bench, USACO, and AppWorld. It supports flexible execution environments, including isolated local setups via Conda or Docker, and scalable cloud deployments on Azure VMs. Integrations with Weave for detailed cost and usage metrics, and HuggingFace for secure agent trace uploads, ensure comprehensive monitoring and result sharing. This approach prioritizes agent implementation flexibility and reproducible, cost-conscious evaluation.

Quick Start & Requirements

Setup involves cloning the repository (git clone --recursive), creating a Conda environment (conda create -n hal python=3.12), activating it (conda activate hal), and installing the package (pip install -e .). Users must configure API keys (HuggingFace, Weave, LLMs) in a .env file and install model provider SDKs (e.g., pip install openai). Optional Docker or Azure VM support requires additional setup and dependencies.

Highlighted Details

  • Supports a wide array of benchmarks including SWE-bench Verified, USACO, AppWorld, CORE-bench, tau-bench, SciCode, AssistantBench, ScienceAgentBench, and CollaborativeAgentBench.
  • Features a unified CLI (hal-eval) for cross-benchmark agent evaluation and parallelized execution capabilities.
  • Integrates Weave for detailed cost tracking and usage metrics, alongside automatic encryption and upload of agent traces to HuggingFace Hub.
  • Enables local, Docker-containerized, or Azure VM-based evaluation environments for flexibility and isolation.

Maintenance & Community

The project is associated with Princeton University's Pli group, as indicated by the citation. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the provided README.

Licensing & Compatibility

The README does not explicitly state a software license. Given the academic affiliation and citation format, users should verify licensing terms for commercial or closed-source integration.

Limitations & Caveats

The SWE-bench Verified (Mini) benchmark is explicitly noted as incompatible with arm64 architectures (e.g., Mac M chips). Several benchmarks require manual dataset downloads, specific dependency installations, or decryption of sensitive files (e.g., CORE-bench GPG).

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
16
Issues (30d)
3
Star History
17 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.