AgentBoard by hkust-nlp

Analytical evaluation board for multi-turn LLM agents

Created 2 years ago

395 stars

Top 73.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Will Brown

Research Lead at Prime Intellect

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

AgentBoard provides a comprehensive analytical evaluation framework for multi-turn LLM agents across diverse environments. It targets researchers and developers aiming to systematically assess and compare the generalist capabilities of LLM agents, offering detailed insights into performance across various dimensions.

How It Works

AgentBoard employs four core principles: task diversity (9 tasks across Embodied AI, Game, Web, Tool), multi-round interaction, partially-observable environments, and analytical evaluation. It facilitates the construction of goal-oriented reflex agents and provides a Weights & Biases-integrated panel for visualizing fine-grained progress, grounding accuracy, and performance breakdowns. This approach enables a deeper understanding of agent behavior beyond simple success rates.

Quick Start & Requirements

Installation: Local setup via setup.sh script (requires Python 3.8.13, conda). Docker image available.
Prerequisites: Internet access for certain tasks. For WebArena, dbus and Xvfb are required. API keys for proprietary models (OpenAI, Anthropic) and potentially tool-specific keys are needed. Weights & Biases API key for visualization.
Data: Download via wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz.
Resources: Local setup estimated at 15 minutes; Docker setup at 5 minutes (12GB). Evaluation runtime varies significantly by model and hardware (e.g., GPT-4: ~5.5h, DeepSpeed-67b on 8xV100 with vLLM: ~18.5h).
Links: Website, Leaderboard, Paper, Data.

Highlighted Details

Supports 12 SOTA LLM models including GPT-4, Claude2, Llama2, Mistral, and CodeLlama, with vLLM acceleration for open-source models.
Integrated Weights & Biases panel for detailed, multi-dimensional analysis and visualization of agent performance.
Includes 9 diverse tasks: AlfWorld, ScienceWorld, BabyAI, Jericho, PDDL, WebShop, WebArena, Tool-Query, Tool-Operation.
Captures detailed trajectory logs, including screenshots and network traffic for WebArena.

Maintenance & Community

Active development with NeurIPS 2024 Oral and ICLR 2024 LLMAgents acceptance.
Community support via Slack: Join Slack.

Licensing & Compatibility

Code License: Apache-2.0.
Data License: GNU General Public License, version 2.
The GPL-2.0 license for the dataset may impose restrictions on commercial use or derivative works if they incorporate the dataset.

Limitations & Caveats

WebArena task setup requires specific system dependencies (dbus, Xvfb) which may be challenging on some systems.
Proprietary model evaluation requires obtaining and managing API keys, adding an external dependency and potential cost.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days