AgentBench by THUDM

Benchmark for evaluating LLMs as agents across diverse environments

Created 2 years ago

3,176 stars

Top 14.8% on SourcePulse

View on GitHub

11 Experts Love This Project

Elvis Saravia

Founder of DAIR.AI

Vincent Weisser

Cofounder of Prime Intellect

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Victor Taelin

Author of Bend, Kind, HVM

and 7 more!

Project Summary

AgentBench provides a comprehensive benchmark for evaluating Large Language Models (LLMs) as autonomous agents across diverse simulated environments. It targets researchers and developers aiming to assess and improve LLM agent capabilities in tasks ranging from operating systems and databases to web browsing and games. The benchmark offers a standardized framework and a leaderboard to track progress in this rapidly evolving field.

How It Works

AgentBench employs a modular framework that simulates various real-world scenarios as distinct tasks. LLMs interact with these environments by generating actions (e.g., commands, queries, tool calls) which are then executed by the environment. The framework orchestrates this interaction, managing the LLM's state, environment feedback, and task progression. This approach allows for systematic evaluation of an LLM's ability to plan, reason, and execute multi-step actions in complex, interactive settings.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install -r requirements.txt within a conda environment (Python 3.9 recommended).
Prerequisites: Docker is required for several tasks. OpenAI API key is needed for default agent configuration. Specific Docker images for tasks like webshop, mind2web, etc., need to be pulled or built.
Setup: Initial setup involves cloning, environment creation, dependency installation, and Docker image setup, which can take approximately 1 minute for task workers to initialize after starting the controller.
Links: Website, Paper, Slack

Highlighted Details

Evaluates LLMs across 8 distinct environments, including 5 novel domains (OS, DB, KG, DCG, LTP) and 3 recompiled from existing datasets (HH, WS, WB).
Introduces VisualAgentBench for evaluating visual foundation agents with 5 new environments and support for 17 LMMs.
Provides full data splits for Dev and Test sets, with multi-turn interactions requiring thousands of LLM calls per dataset.
Offers a public leaderboard to compare performance across various LLMs.

Maintenance & Community

Developed by THUDM.
Active community engagement via Slack for Q&A and collaboration.
Recent updates include VisualAgentBench and AgentBench v0.2.

Licensing & Compatibility

The repository itself appears to be under a permissive license (likely MIT, common for THUDM projects, though not explicitly stated in the provided text).
Compatibility for commercial use is generally high for permissive licenses, but specific task environments or underlying datasets might have their own licensing terms.

Limitations & Caveats

Some tasks require significant memory (e.g., webshop ~15GB).
The KnowledgeGraph task depends on an external SPARQL service, with instructions provided for local deployment.
Performance can vary significantly between different LLMs and tasks, with notable gaps towards practical usability highlighted.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

68 stars in the last 30 days