VisualAgentBench  by THUDM

Visual foundation agents benchmark for LMMs

Created 1 year ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

VisualAgentBench (VAB) addresses the need for systematic evaluation and development of Large Multimodal Models (LMMs) as visual foundation agents. It provides a comprehensive benchmark suite covering Embodied, GUI, and Visual Design tasks across five distinct environments. VAB enables researchers and practitioners to assess LMM capabilities in visually-grounded interactive scenarios and facilitates the development of more potent visual agents through its unique trajectory training dataset.

How It Works

VAB builds upon the AgentBench framework, employing an Agent-Controller, Task-Controller, and Assigner architecture for efficient, parallelized agent evaluation. Its core innovation lies in offering a trajectory training set specifically designed for behavior cloning (BC). This allows open Large Language Models (LLMs) and LMMs to be trained on agent task trajectories, enhancing their ability to follow complex instructions and perform visual tasks, a capability often lacking in base models.

Quick Start & Requirements

Setup involves cloning the repository, creating and activating a Conda environment (python=3.9), and installing dependencies (pip install -r requirements.txt). Docker is a prerequisite. Users must configure their OpenAI API Key in configs/agents/openai-chat.yaml. To run tasks, first start the task server (python -m src.start_task -a), which typically takes about a minute to launch four workers. Subsequently, initiate the evaluation via the assigner (python -m src.assigner --auto-retry). Specific environments like VAB-WebArena-Lite may have additional setup instructions.

Highlighted Details

  • Features five environments: VAB-OmniGibson, VAB-Minecraft, VAB-Mobile (ongoing), VAB-WebArena-Lite, and VAB-CSS.
  • Covers three core visual agent task types: Embodied, GUI, and Visual Design.
  • Includes a trajectory dataset for behavior cloning, crucial for training open LMMs.
  • Maintains a leaderboard comparing proprietary and fine-tuned open LMM performance based on Success Rate (SR).

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or maintenance indicators (e.g., sponsorships, active development signals) are detailed in the provided README snippet.

Licensing & Compatibility

Licensing information is not specified in the provided README content.

Limitations & Caveats

The VAB-Mobile environment is currently marked as "Ongoing." VAB-WebArena-Lite requires a separate installation and evaluation procedure. Open LMMs generally struggle with complex agent task instructions without prior finetuning on the VAB training dataset.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI) and Jianwei Yang Jianwei Yang(Research Scientist at Meta Superintelligence Lab).

allenact by allenai

0%
377
Open-source framework for embodied AI research
Created 6 years ago
Updated 6 months ago
Feedback? Help us improve.