Discover and explore top open-source AI tools and projects—updated daily.
rdi-berkeleyEvaluating AI agents on complex, real-world tasks
Top 46.1% on SourcePulse
Summary
Agents' Last Exam (ALE) provides a comprehensive benchmark for evaluating AI agents on economically valuable, long-horizon tasks across non-physical industries. It targets frontier agent systems, offering a standardized framework with verifiable outcomes and real-world operating environments, enabling objective performance measurement for researchers and developers.
How It Works
ALE evaluates advanced agent systems, comprising a foundation model, action loop, tools, and memory, by providing only a task description. Agents operate within realistic, full OS sandboxes (Windows/Linux) equipped with professional software and data, mimicking production contexts. The ale_run toolkit provisions these environments, executes agents, and grades their output against hidden, deterministic references. It specifically assesses "Generalist CUA-agents" capable of both CLI and GUI interactions, leveraging a unified CUA MCP bridge. Each run is fully recorded, generating a uniform trajectory, logs, and artifacts for complete auditability and replayability.
Quick Start & Requirements
A single command initiates a sandbox, runs a demo agent on a basic task, and grades the result, following a one-time Google Cloud setup (~10 min, covered by a $300 free trial). Setup involves project creation, sandbox image copying, and API key configuration. Detailed end-to-end guidance is available in docs/quickstart.md. Relevant links include the official website (https://agents-last-exam.org/), arXiv paper (https://arxiv.org/abs/2606.05405), Hugging Face (https://huggingface.co/agents-last-exam), and the leaderboard (https://agenthle.org/leaderboard).
Highlighted Details
Maintenance & Community
The project is led by UC Berkeley RDI and the RDI Foundation, with contributions from over 300 industry experts. Updates and news can be followed via a mailing list (https://groups.google.com/g/agenthle-news). Direct contact is available at rdi_research@berkeley.edu.
Licensing & Compatibility
The core framework software (e.g., ale_run, documentation) is licensed under Apache-2.0, permitting commercial use and modification. Benchmark data, including tasks and sample runs, is licensed under CC BY 4.0, requiring attribution.
Limitations & Caveats
Initial setup requires Google Cloud integration and incurs associated costs, though a free trial is provided. The benchmark currently focuses exclusively on non-physical industries. The extensive nature of the task corpus means only a subset is publicly available. The framework is designed for evaluating sophisticated "frontier" agent systems.
1 day ago
Inactive
facebookresearch