webclaw  by 0xMassi

Fast, local web content extraction for AI agents

Created 4 weeks ago

New!

461 stars

Top 65.6% on SourcePulse

GitHubView on GitHub
Project Summary

Fast, local-first web content extraction for LLMs. webclaw is a high-performance tool built in Rust, designed for AI agents and developers. It addresses the challenges of slow, token-inefficient, and often blocked web scraping by offering sub-millisecond extraction speeds, significantly reduced token usage for LLMs, and robust TLS fingerprinting to bypass common bot protections without requiring a headless browser.

How It Works

Leveraging Rust and advanced TLS fingerprinting techniques, webclaw bypasses the overhead of headless browsers like Chrome or Puppeteer. It directly fetches and parses web content, stripping away non-essential elements such as navigation, ads, and footers to produce clean, structured output. This output is specifically optimized for LLMs, reducing token count by up to 67% compared to raw HTML while preserving essential metadata, links, and images.

Quick Start & Requirements

Installation is streamlined via npx create-webclaw for automatic AI tool integration, Homebrew (brew install webclaw), prebuilt binaries, Cargo, or Docker. Local LLM features (summarization, structured extraction) require a running Ollama instance. Optional cloud API access for advanced features like bot bypass and JavaScript rendering necessitates a WEBCLAW_API_KEY.

Highlighted Details

  • Achieves sub-millisecond extraction speeds (e.g., 3.2ms for a 100KB page) with zero browser overhead.
  • Optimized LLM output reduces token usage by up to 67% compared to raw HTML.
  • Features an MCP server for seamless integration with AI agents like Claude and Cursor.
  • Boasts high extraction accuracy (95.1%) and noise removal (96.1%).
  • Offers 8 out of 10 tools that function locally and privately without API keys.

Maintenance & Community

The project maintains an active community via Discord for questions and feedback, and encourages contributions through GitHub Issues and a dedicated CONTRIBUTING.md file.

Licensing & Compatibility

Distributed under the permissive MIT License, webclaw allows for unrestricted use, modification, and distribution, including within commercial and closed-source applications.

Limitations & Caveats

Certain advanced features, such as bypassing sophisticated bot protections, rendering JavaScript-heavy pages, or utilizing search and research tools, require opting into the optional, hosted webclaw.io cloud API. Local LLM functionalities depend on a correctly configured Ollama or similar service.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
8
Star History
464 stars in the last 29 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Dirk Englund Dirk Englund(MIT EECS Professor and Cofounder of Axiomatic AI), and
25 more.

firecrawl by firecrawl

2.9%
105k
API service for turning websites into LLM-ready data
Created 2 years ago
Updated 19 hours ago
Feedback? Help us improve.