WebVoyager  by MinorJerry

Web agent for end-to-end website interaction using multimodal models

Created 1 year ago
914 stars

Top 39.9% on SourcePulse

GitHubView on GitHub
Project Summary

WebVoyager is an end-to-end web agent designed to complete user instructions on real-world websites by integrating textual and visual information. It targets researchers and developers looking to build sophisticated web automation tools powered by Large Multimodal Models (LMMs), offering a generalist planning approach for navigation and an automated evaluation protocol.

How It Works

WebVoyager leverages LMMs, specifically GPT-4V, to process both visual (screenshots) and textual (accessibility trees) information from web pages. It employs a planning approach to navigate websites and interact with elements, aiming to complete tasks autonomously. The system uses Selenium for web browsing and includes a mechanism for extracting interactive elements with bounding boxes for precise interaction.

Quick Start & Requirements

  • Install: Create a conda environment (conda create -n webvoyager python=3.10), activate it (conda activate webvoyager), and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Chrome browser (ChromeDriver is automatically handled by recent Selenium versions), Python 3.10.13. For Linux servers, Chromium installation is recommended. An OpenAI API key is required.
  • Data: The project includes datasets in data/WebVoyager_data.jsonl and data/GAIA_web.jsonl.
  • Running: Execute tasks using bash run.sh, ensuring your OpenAI API key is set.
  • Docs: Paper

Highlighted Details

  • Integrates textual and visual information for end-to-end web task completion.
  • Features a generalist planning approach for web navigation.
  • Includes an automated evaluation protocol using GPT-4V.
  • Provides a dataset of 643 task queries across 15 websites and extracts tasks from the GAIA dataset.

Maintenance & Community

The project is associated with authors from Tencent. The README does not specify community channels or a roadmap.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Tasks involving time-sensitive information (e.g., Booking, Google Flights) require manual date updates. The project disclaimer notes that results can be influenced by API non-determinism, prompt changes, and website updates, and does not guarantee accuracy or assume legal responsibility for outputs.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
28 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Gregor Zunic Gregor Zunic(Cofounder of Browser Use), and
1 more.

BrowserGym by ServiceNow

0.8%
895
Gym environment for web task automation research
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.