molmoweb by allenai

Autonomous multimodal web agent SDK

Created 2 months ago

563 stars

Top 56.5% on SourcePulse

Project Summary

MolmoWeb is an open multimodal web agent designed for autonomous web navigation and task completion. It empowers researchers and developers to automate complex interactions with web browsers, such as clicking, typing, and scrolling, driven by natural language prompts. The project provides the agent code, inference client, and evaluation benchmarks, enabling reproducible results for automated web tasks.

How It Works

MolmoWeb employs a multimodal large language model that interprets natural language tasks in conjunction with visual and structural information from web pages (screenshots and accessibility trees). It autonomously generates a sequence of browser actions—clicking elements, typing text, scrolling, and navigating URLs—to fulfill user-defined objectives. This approach allows for sophisticated, multi-step task execution directly within a web browser environment.

Quick Start & Requirements

Installation: Requires Python 3.10+. Uses uv for dependency management. Clone the repository, create a virtual environment with uv venv, and sync dependencies with uv sync. Playwright browsers must be installed via uv run playwright install --with-deps chromium.
Prerequisites: Environment variables for API keys are required for specific backends: BROWSERBASE_API_KEY and BROWSERBASE_PROJECT_ID for Browserbase; GOOGLE_API_KEY for Google Gemini; OPENAI_API_KEY for GPT-based judges.
Model Download & Server: Use scripts/download_weights.sh to fetch models (e.g., allenai/MolmoWeb-8B) and scripts/start_server.sh to launch a local inference server.
Testing: Execute scripts/test_server.py to test the running model server.
Documentation: Links to Paper, Blog Post, Demo, Models, and Data are provided.

Highlighted Details

Offers multiple model sizes: MolmoWeb-8B and MolmoWeb-4B, with both HuggingFace/transformers-compatible and native checkpoints available.
Provides a Python inference client for end-to-end task execution, supporting single queries, batch queries, and continuous follow-up queries.
Supports various inference backends including FastAPI, Modal, and native/HF checkpoints, with vLLM support planned.
Includes functionality to extract accessibility trees from web pages.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or details on notable contributors or sponsorships are mentioned in the provided text.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The TODO list indicates that evaluation (Eval) and training (Training) functionalities are not yet fully implemented. Specific backend configurations require external API keys, which may present an adoption hurdle.

molmoweb by allenai

Explore Similar Projects

browser by CognosysAI

skills by ninehills

surf-cli by nicobailon

fuji-web by normal-computing

browserpilot by handrew

browserbee by parsaghaffari

browserwing by browserwing

stagehand-python by browserbase

actionbook by actionbook

skills by browserbase

computer-use-preview by google-gemini

stagehand by browserbase