agents-last-exam  by rdi-berkeley

Evaluating AI agents on complex, real-world tasks

Created 1 month ago
743 stars

Top 46.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Agents' Last Exam (ALE) provides a comprehensive benchmark for evaluating AI agents on economically valuable, long-horizon tasks across non-physical industries. It targets frontier agent systems, offering a standardized framework with verifiable outcomes and real-world operating environments, enabling objective performance measurement for researchers and developers.

How It Works

ALE evaluates advanced agent systems, comprising a foundation model, action loop, tools, and memory, by providing only a task description. Agents operate within realistic, full OS sandboxes (Windows/Linux) equipped with professional software and data, mimicking production contexts. The ale_run toolkit provisions these environments, executes agents, and grades their output against hidden, deterministic references. It specifically assesses "Generalist CUA-agents" capable of both CLI and GUI interactions, leveraging a unified CUA MCP bridge. Each run is fully recorded, generating a uniform trajectory, logs, and artifacts for complete auditability and replayability.

Quick Start & Requirements

A single command initiates a sandbox, runs a demo agent on a basic task, and grades the result, following a one-time Google Cloud setup (~10 min, covered by a $300 free trial). Setup involves project creation, sandbox image copying, and API key configuration. Detailed end-to-end guidance is available in docs/quickstart.md. Relevant links include the official website (https://agents-last-exam.org/), arXiv paper (https://arxiv.org/abs/2606.05405), Hugging Face (https://huggingface.co/agents-last-exam), and the leaderboard (https://agenthle.org/leaderboard).

Highlighted Details

  • Broad Industry Coverage: Benchmarks agents across 55 industries using 150 public reference tasks (from a 1,500+ task corpus).
  • Verifiable Outcomes: Employs hidden references and deterministic graders for objective scoring.
  • Long-Horizon Tasks: Focuses on multi-step workflows executed within realistic operating system sandboxes.
  • Economically Valuable Tasks: Tasks are sourced and validated by industry experts to reflect real-world value.
  • CUA Agent Evaluation: Assesses agents combining Command-Line Interface (CLI) and Graphical User Interface (GUI) capabilities.

Maintenance & Community

The project is led by UC Berkeley RDI and the RDI Foundation, with contributions from over 300 industry experts. Updates and news can be followed via a mailing list (https://groups.google.com/g/agenthle-news). Direct contact is available at rdi_research@berkeley.edu.

Licensing & Compatibility

The core framework software (e.g., ale_run, documentation) is licensed under Apache-2.0, permitting commercial use and modification. Benchmark data, including tasks and sample runs, is licensed under CC BY 4.0, requiring attribution.

Limitations & Caveats

Initial setup requires Google Cloud integration and incurs associated costs, though a free trial is provided. The benchmark currently focuses exclusively on non-physical industries. The extensive nature of the task corpus means only a subset is publicly available. The framework is designed for evaluating sophisticated "frontier" agent systems.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
20
Issues (30d)
5
Star History
745 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.