VisualAgentBench by THUDM

Visual foundation agents benchmark for LMMs

Created 1 year ago

270 stars

Top 95.1% on SourcePulse

Project Summary

Summary

VisualAgentBench (VAB) addresses the need for systematic evaluation and development of Large Multimodal Models (LMMs) as visual foundation agents. It provides a comprehensive benchmark suite covering Embodied, GUI, and Visual Design tasks across five distinct environments. VAB enables researchers and practitioners to assess LMM capabilities in visually-grounded interactive scenarios and facilitates the development of more potent visual agents through its unique trajectory training dataset.

How It Works

VAB builds upon the AgentBench framework, employing an Agent-Controller, Task-Controller, and Assigner architecture for efficient, parallelized agent evaluation. Its core innovation lies in offering a trajectory training set specifically designed for behavior cloning (BC). This allows open Large Language Models (LLMs) and LMMs to be trained on agent task trajectories, enhancing their ability to follow complex instructions and perform visual tasks, a capability often lacking in base models.

Quick Start & Requirements

Setup involves cloning the repository, creating and activating a Conda environment (python=3.9), and installing dependencies (pip install -r requirements.txt). Docker is a prerequisite. Users must configure their OpenAI API Key in configs/agents/openai-chat.yaml. To run tasks, first start the task server (python -m src.start_task -a), which typically takes about a minute to launch four workers. Subsequently, initiate the evaluation via the assigner (python -m src.assigner --auto-retry). Specific environments like VAB-WebArena-Lite may have additional setup instructions.

Highlighted Details

Features five environments: VAB-OmniGibson, VAB-Minecraft, VAB-Mobile (ongoing), VAB-WebArena-Lite, and VAB-CSS.
Covers three core visual agent task types: Embodied, GUI, and Visual Design.
Includes a trajectory dataset for behavior cloning, crucial for training open LMMs.
Maintains a leaderboard comparing proprietary and fine-tuned open LMM performance based on Success Rate (SR).

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or maintenance indicators (e.g., sponsorships, active development signals) are detailed in the provided README snippet.

Licensing & Compatibility

Licensing information is not specified in the provided README content.

Limitations & Caveats

The VAB-Mobile environment is currently marked as "Ongoing." VAB-WebArena-Lite requires a separate installation and evaluation procedure. Open LMMs generally struggle with complex agent task instructions without prior finetuning on the VAB training dataset.

VisualAgentBench by THUDM

Explore Similar Projects

Awesome-GUI-Agents by ZJU-REAL

awesome-large-multimodal-agents by jun0wanan

EmbodiedBench by EmbodiedBench

AgentBoard by hkust-nlp

claw-eval by claw-eval

uni-agent by verl-project

cap-x by capgym

AgentTuning by THUDM

AgentGym by WooooDyy

Magma by microsoft

oreilly-ai-agents by sinanuozdemir

LLM-Agent-Paper-List by WooooDyy