visualwebarena  by web-arena-x

Benchmark for multimodal agent evaluation on visual web tasks

Created 1 year ago
378 stars

Top 75.2% on SourcePulse

GitHubView on GitHub
Project Summary

VisualWebArena provides a benchmark for evaluating multimodal autonomous language agents on realistic, complex visual web tasks. It extends the reproducible, execution-based evaluation framework of WebArena, targeting researchers and developers building and testing AI agents that interact with the visual web.

How It Works

VisualWebArena leverages an execution-based evaluation approach, simulating agent interactions within web environments. It supports various observation types, including accessibility trees and image-based observations augmented with Segment-Anything-Mask (SoM) for detailed visual grounding. This allows for precise evaluation of agent capabilities in understanding and manipulating visual web content.

Quick Start & Requirements

  • Install: Clone the repository, create and activate a Python 3.10/3.11 virtual environment, install requirements (pip install -r requirements.txt), install Playwright (playwright install), and install the package (pip install -e .).
  • Prerequisites: Python 3.10 or 3.11 (not 3.12), Playwright, API keys for models (OpenAI or Google Cloud for Gemini). GPU is recommended for captioning models (approx. 12GB VRAM).
  • Setup: Requires configuring environment variables for website domains and reset tokens. An Amazon Machine Image is available for pre-installed environments.
  • Docs: Website

Highlighted Details

  • Supports multiple LLM providers (OpenAI, Gemini) and observation types (accessibility tree, image + SoM).
  • Includes agent trajectories for GPT-4V + SoM on 910 VWA tasks.
  • Offers a demo script to run agents on arbitrary webpages.
  • Provides human evaluation trajectories on 233 tasks.

Maintenance & Community

The project is based on the WebArena codebase. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, given its foundation on WebArena, users should verify licensing terms for commercial use or integration into closed-source projects.

Limitations & Caveats

The project requires specific Python versions (3.10/3.11) due to dependency on deprecated modules. Setting up the diverse web environments and configuring them requires significant effort and understanding of environment variables.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
1
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Gregor Zunic Gregor Zunic(Cofounder of Browser Use), and
1 more.

BrowserGym by ServiceNow

0.8%
895
Gym environment for web task automation research
Created 1 year ago
Updated 1 day ago
Starred by Kevin Hou Kevin Hou(Head of Product Engineering at Windsurf), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
29 more.

browser-use by browser-use

0.6%
70k
SDK for AI agent browser control
Created 10 months ago
Updated 1 day ago
Feedback? Help us improve.