AgentBoard  by hkust-nlp

Analytical evaluation board for multi-turn LLM agents

created 1 year ago
333 stars

Top 83.6% on sourcepulse

GitHubView on GitHub
Project Summary

AgentBoard provides a comprehensive analytical evaluation framework for multi-turn LLM agents across diverse environments. It targets researchers and developers aiming to systematically assess and compare the generalist capabilities of LLM agents, offering detailed insights into performance across various dimensions.

How It Works

AgentBoard employs four core principles: task diversity (9 tasks across Embodied AI, Game, Web, Tool), multi-round interaction, partially-observable environments, and analytical evaluation. It facilitates the construction of goal-oriented reflex agents and provides a Weights & Biases-integrated panel for visualizing fine-grained progress, grounding accuracy, and performance breakdowns. This approach enables a deeper understanding of agent behavior beyond simple success rates.

Quick Start & Requirements

  • Installation: Local setup via setup.sh script (requires Python 3.8.13, conda). Docker image available.
  • Prerequisites: Internet access for certain tasks. For WebArena, dbus and Xvfb are required. API keys for proprietary models (OpenAI, Anthropic) and potentially tool-specific keys are needed. Weights & Biases API key for visualization.
  • Data: Download via wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz.
  • Resources: Local setup estimated at 15 minutes; Docker setup at 5 minutes (12GB). Evaluation runtime varies significantly by model and hardware (e.g., GPT-4: ~5.5h, DeepSpeed-67b on 8xV100 with vLLM: ~18.5h).
  • Links: Website, Leaderboard, Paper, Data.

Highlighted Details

  • Supports 12 SOTA LLM models including GPT-4, Claude2, Llama2, Mistral, and CodeLlama, with vLLM acceleration for open-source models.
  • Integrated Weights & Biases panel for detailed, multi-dimensional analysis and visualization of agent performance.
  • Includes 9 diverse tasks: AlfWorld, ScienceWorld, BabyAI, Jericho, PDDL, WebShop, WebArena, Tool-Query, Tool-Operation.
  • Captures detailed trajectory logs, including screenshots and network traffic for WebArena.

Maintenance & Community

  • Active development with NeurIPS 2024 Oral and ICLR 2024 LLMAgents acceptance.
  • Community support via Slack: Join Slack.

Licensing & Compatibility

  • Code License: Apache-2.0.
  • Data License: GNU General Public License, version 2.
  • The GPL-2.0 license for the dataset may impose restrictions on commercial use or derivative works if they incorporate the dataset.

Limitations & Caveats

  • WebArena task setup requires specific system dependencies (dbus, Xvfb) which may be challenging on some systems.
  • Proprietary model evaluation requires obtaining and managing API keys, adding an external dependency and potential cost.
Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Victor Taelin Victor Taelin(Author of Bend, Kind, HVM), and
4 more.

AgentBench by THUDM

0.6%
3k
Benchmark for evaluating LLMs as agents across diverse environments
created 2 years ago
updated 6 months ago
Feedback? Help us improve.