visualwebarena  by web-arena-x

Benchmark for multimodal agent evaluation on visual web tasks

created 1 year ago
364 stars

Top 78.4% on sourcepulse

GitHubView on GitHub
Project Summary

VisualWebArena provides a benchmark for evaluating multimodal autonomous language agents on realistic, complex visual web tasks. It extends the reproducible, execution-based evaluation framework of WebArena, targeting researchers and developers building and testing AI agents that interact with the visual web.

How It Works

VisualWebArena leverages an execution-based evaluation approach, simulating agent interactions within web environments. It supports various observation types, including accessibility trees and image-based observations augmented with Segment-Anything-Mask (SoM) for detailed visual grounding. This allows for precise evaluation of agent capabilities in understanding and manipulating visual web content.

Quick Start & Requirements

  • Install: Clone the repository, create and activate a Python 3.10/3.11 virtual environment, install requirements (pip install -r requirements.txt), install Playwright (playwright install), and install the package (pip install -e .).
  • Prerequisites: Python 3.10 or 3.11 (not 3.12), Playwright, API keys for models (OpenAI or Google Cloud for Gemini). GPU is recommended for captioning models (approx. 12GB VRAM).
  • Setup: Requires configuring environment variables for website domains and reset tokens. An Amazon Machine Image is available for pre-installed environments.
  • Docs: Website

Highlighted Details

  • Supports multiple LLM providers (OpenAI, Gemini) and observation types (accessibility tree, image + SoM).
  • Includes agent trajectories for GPT-4V + SoM on 910 VWA tasks.
  • Offers a demo script to run agents on arbitrary webpages.
  • Provides human evaluation trajectories on 233 tasks.

Maintenance & Community

The project is based on the WebArena codebase. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, given its foundation on WebArena, users should verify licensing terms for commercial use or integration into closed-source projects.

Limitations & Caveats

The project requires specific Python versions (3.10/3.11) due to dependency on deprecated modules. Setting up the diverse web environments and configuring them requires significant effort and understanding of environment variables.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
1 more.

webarena by web-arena-x

1.1%
1k
Web environment for autonomous agent development
created 2 years ago
updated 5 months ago
Feedback? Help us improve.