Web agent for end-to-end website interaction using multimodal models
Top 42.2% on sourcepulse
WebVoyager is an end-to-end web agent designed to complete user instructions on real-world websites by integrating textual and visual information. It targets researchers and developers looking to build sophisticated web automation tools powered by Large Multimodal Models (LMMs), offering a generalist planning approach for navigation and an automated evaluation protocol.
How It Works
WebVoyager leverages LMMs, specifically GPT-4V, to process both visual (screenshots) and textual (accessibility trees) information from web pages. It employs a planning approach to navigate websites and interact with elements, aiming to complete tasks autonomously. The system uses Selenium for web browsing and includes a mechanism for extracting interactive elements with bounding boxes for precise interaction.
Quick Start & Requirements
conda create -n webvoyager python=3.10
), activate it (conda activate webvoyager
), and install dependencies (pip install -r requirements.txt
).data/WebVoyager_data.jsonl
and data/GAIA_web.jsonl
.bash run.sh
, ensuring your OpenAI API key is set.Highlighted Details
Maintenance & Community
The project is associated with authors from Tencent. The README does not specify community channels or a roadmap.
Licensing & Compatibility
Limitations & Caveats
Tasks involving time-sensitive information (e.g., Booking, Google Flights) require manual date updates. The project disclaimer notes that results can be influenced by API non-determinism, prompt changes, and website updates, and does not guarantee accuracy or assume legal responsibility for outputs.
1 year ago
1 week