visualwebarena by web-arena-x

Benchmark for multimodal agent evaluation on visual web tasks

Created 2 years ago

436 stars

Top 68.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Travis Fischer

Founder of Agentic

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

VisualWebArena provides a benchmark for evaluating multimodal autonomous language agents on realistic, complex visual web tasks. It extends the reproducible, execution-based evaluation framework of WebArena, targeting researchers and developers building and testing AI agents that interact with the visual web.

How It Works

VisualWebArena leverages an execution-based evaluation approach, simulating agent interactions within web environments. It supports various observation types, including accessibility trees and image-based observations augmented with Segment-Anything-Mask (SoM) for detailed visual grounding. This allows for precise evaluation of agent capabilities in understanding and manipulating visual web content.

Quick Start & Requirements

Install: Clone the repository, create and activate a Python 3.10/3.11 virtual environment, install requirements (pip install -r requirements.txt), install Playwright (playwright install), and install the package (pip install -e .).
Prerequisites: Python 3.10 or 3.11 (not 3.12), Playwright, API keys for models (OpenAI or Google Cloud for Gemini). GPU is recommended for captioning models (approx. 12GB VRAM).
Setup: Requires configuring environment variables for website domains and reset tokens. An Amazon Machine Image is available for pre-installed environments.
Docs: Website

Highlighted Details

Supports multiple LLM providers (OpenAI, Gemini) and observation types (accessibility tree, image + SoM).
Includes agent trajectories for GPT-4V + SoM on 910 VWA tasks.
Offers a demo script to run agents on arbitrary webpages.
Provides human evaluation trajectories on 233 tasks.

Maintenance & Community

The project is based on the WebArena codebase. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, given its foundation on WebArena, users should verify licensing terms for commercial use or integration into closed-source projects.

Limitations & Caveats

The project requires specific Python versions (3.10/3.11) due to dependency on deprecated modules. Setting up the diverse web environments and configuring them requires significant effort and understanding of environment variables.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days