Benchmark for multimodal agent evaluation on visual web tasks
Top 78.4% on sourcepulse
VisualWebArena provides a benchmark for evaluating multimodal autonomous language agents on realistic, complex visual web tasks. It extends the reproducible, execution-based evaluation framework of WebArena, targeting researchers and developers building and testing AI agents that interact with the visual web.
How It Works
VisualWebArena leverages an execution-based evaluation approach, simulating agent interactions within web environments. It supports various observation types, including accessibility trees and image-based observations augmented with Segment-Anything-Mask (SoM) for detailed visual grounding. This allows for precise evaluation of agent capabilities in understanding and manipulating visual web content.
Quick Start & Requirements
pip install -r requirements.txt
), install Playwright (playwright install
), and install the package (pip install -e .
).Highlighted Details
Maintenance & Community
The project is based on the WebArena codebase. Links to community channels are not explicitly provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. However, given its foundation on WebArena, users should verify licensing terms for commercial use or integration into closed-source projects.
Limitations & Caveats
The project requires specific Python versions (3.10/3.11) due to dependency on deprecated modules. Setting up the diverse web environments and configuring them requires significant effort and understanding of environment variables.
8 months ago
1 day