reader  by vakra-dev

Production-grade web scraping engine for LLMs

Created 1 month ago
445 stars

Top 67.4% on SourcePulse

GitHubView on GitHub
Project Summary

Reader: Production-Grade Web Scraping for LLM Agents

Reader is an open-source, production-grade web scraping engine designed to simplify the process of building LLM agents that require web access. It addresses the common frustrations of unreliable scraping by handling complex anti-bot measures, browser management, and data cleaning, delivering clean markdown output ready for AI consumption. The primary benefit is a robust and simplified web scraping solution for developers.

How It Works

Reader is built upon Ulixee Hero, a headless browser specifically engineered for sophisticated web scraping tasks. Its core approach involves abstracting the complexities of browser architecture, anti-bot bypass techniques (including Cloudflare, TLS fingerprinting, and proxy infrastructure), and resource management. For data transformation, Reader utilizes supermarkdown, a Rust-based converter optimized for handling real-world, often malformed HTML, producing clean, LLM-friendly markdown. This dual focus on robust scraping and clean output differentiates it from simpler HTML-to-markdown tools.

Quick Start & Requirements

  • Installation: npm install @vakra-dev/reader
  • Prerequisites: Node.js version 18 or higher. For headless Linux servers (e.g., VPS, EC2), system dependencies for Chromium must be installed (e.g., libnspr4, libnss3, libatk1.0-0, etc.).
  • Documentation: Full documentation is available at docs.reader.dev.

Highlighted Details

  • Advanced Anti-Bot Bypass: Features Cloudflare bypass through TLS fingerprinting, DNS over TLS, and WebRTC masking.
  • Clean Output: Provides markdown and HTML with automatic main content extraction, removing common page artifacts like headers, footers, and cookie banners.
  • Flexible Usage: Offers both a Command Line Interface (CLI) and a programmatic API.
  • Browser Pooling: Manages a pool of browser instances with auto-recycling, health monitoring, and request queuing for efficient resource utilization.
  • Website Crawling: Includes functionality for Breadth-First Search (BFS) link discovery with configurable depth and page limits.
  • Proxy Support: Integrates datacenter and residential proxies with rotation strategies.

Maintenance & Community

The project is actively maintained, with Discord and GitHub Issues available for support and community interaction. The primary author is Nihal Kaul.

Licensing & Compatibility

Reader is licensed under the Apache 2.0 license. This license is permissive and generally compatible with commercial use and integration into closed-source projects.

Limitations & Caveats

Deploying on headless Linux servers requires careful installation of specific system dependencies for the underlying Chromium browser, similar to other headless browser tools like Puppeteer and Playwright. The effectiveness of anti-bot bypass mechanisms can vary as anti-scraping technologies evolve.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
451 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.