AgentBench  by THUDM

Benchmark for evaluating LLMs as agents across diverse environments

created 2 years ago
2,713 stars

Top 17.9% on sourcepulse

GitHubView on GitHub
Project Summary

AgentBench provides a comprehensive benchmark for evaluating Large Language Models (LLMs) as autonomous agents across diverse simulated environments. It targets researchers and developers aiming to assess and improve LLM agent capabilities in tasks ranging from operating systems and databases to web browsing and games. The benchmark offers a standardized framework and a leaderboard to track progress in this rapidly evolving field.

How It Works

AgentBench employs a modular framework that simulates various real-world scenarios as distinct tasks. LLMs interact with these environments by generating actions (e.g., commands, queries, tool calls) which are then executed by the environment. The framework orchestrates this interaction, managing the LLM's state, environment feedback, and task progression. This approach allows for systematic evaluation of an LLM's ability to plan, reason, and execute multi-step actions in complex, interactive settings.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -r requirements.txt within a conda environment (Python 3.9 recommended).
  • Prerequisites: Docker is required for several tasks. OpenAI API key is needed for default agent configuration. Specific Docker images for tasks like webshop, mind2web, etc., need to be pulled or built.
  • Setup: Initial setup involves cloning, environment creation, dependency installation, and Docker image setup, which can take approximately 1 minute for task workers to initialize after starting the controller.
  • Links: Website, Paper, Slack

Highlighted Details

  • Evaluates LLMs across 8 distinct environments, including 5 novel domains (OS, DB, KG, DCG, LTP) and 3 recompiled from existing datasets (HH, WS, WB).
  • Introduces VisualAgentBench for evaluating visual foundation agents with 5 new environments and support for 17 LMMs.
  • Provides full data splits for Dev and Test sets, with multi-turn interactions requiring thousands of LLM calls per dataset.
  • Offers a public leaderboard to compare performance across various LLMs.

Maintenance & Community

  • Developed by THUDM.
  • Active community engagement via Slack for Q&A and collaboration.
  • Recent updates include VisualAgentBench and AgentBench v0.2.

Licensing & Compatibility

  • The repository itself appears to be under a permissive license (likely MIT, common for THUDM projects, though not explicitly stated in the provided text).
  • Compatibility for commercial use is generally high for permissive licenses, but specific task environments or underlying datasets might have their own licensing terms.

Limitations & Caveats

  • Some tasks require significant memory (e.g., webshop ~15GB).
  • The KnowledgeGraph task depends on an external SPARQL service, with instructions provided for local deployment.
  • Performance can vary significantly between different LLMs and tasks, with notable gaps towards practical usability highlighted.
Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
196 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.