molmoweb  by allenai

Autonomous multimodal web agent SDK

Created 2 weeks ago

New!

475 stars

Top 64.1% on SourcePulse

GitHubView on GitHub
Project Summary

MolmoWeb is an open multimodal web agent designed for autonomous web navigation and task completion. It empowers researchers and developers to automate complex interactions with web browsers, such as clicking, typing, and scrolling, driven by natural language prompts. The project provides the agent code, inference client, and evaluation benchmarks, enabling reproducible results for automated web tasks.

How It Works

MolmoWeb employs a multimodal large language model that interprets natural language tasks in conjunction with visual and structural information from web pages (screenshots and accessibility trees). It autonomously generates a sequence of browser actions—clicking elements, typing text, scrolling, and navigating URLs—to fulfill user-defined objectives. This approach allows for sophisticated, multi-step task execution directly within a web browser environment.

Quick Start & Requirements

  • Installation: Requires Python 3.10+. Uses uv for dependency management. Clone the repository, create a virtual environment with uv venv, and sync dependencies with uv sync. Playwright browsers must be installed via uv run playwright install --with-deps chromium.
  • Prerequisites: Environment variables for API keys are required for specific backends: BROWSERBASE_API_KEY and BROWSERBASE_PROJECT_ID for Browserbase; GOOGLE_API_KEY for Google Gemini; OPENAI_API_KEY for GPT-based judges.
  • Model Download & Server: Use scripts/download_weights.sh to fetch models (e.g., allenai/MolmoWeb-8B) and scripts/start_server.sh to launch a local inference server.
  • Testing: Execute scripts/test_server.py to test the running model server.
  • Documentation: Links to Paper, Blog Post, Demo, Models, and Data are provided.

Highlighted Details

  • Offers multiple model sizes: MolmoWeb-8B and MolmoWeb-4B, with both HuggingFace/transformers-compatible and native checkpoints available.
  • Provides a Python inference client for end-to-end task execution, supporting single queries, batch queries, and continuous follow-up queries.
  • Supports various inference backends including FastAPI, Modal, and native/HF checkpoints, with vLLM support planned.
  • Includes functionality to extract accessibility trees from web pages.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or details on notable contributors or sponsorships are mentioned in the provided text.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The TODO list indicates that evaluation (Eval) and training (Training) functionalities are not yet fully implemented. Specific backend configurations require external API keys, which may present an adoption hurdle.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
11
Issues (30d)
5
Star History
488 stars in the last 18 days

Explore Similar Projects

Starred by Will Brown Will Brown(Research Lead at Prime Intellect), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

stagehand by browserbase

0.8%
22k
AI browser automation framework for production
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.