reader by vakra-dev

Production-grade web scraping engine for LLMs

Created 6 months ago

548 stars

Top 57.4% on SourcePulse

Project Summary

Reader: Production-Grade Web Scraping for LLM Agents

Reader is an open-source, production-grade web scraping engine designed to simplify the process of building LLM agents that require web access. It addresses the common frustrations of unreliable scraping by handling complex anti-bot measures, browser management, and data cleaning, delivering clean markdown output ready for AI consumption. The primary benefit is a robust and simplified web scraping solution for developers.

How It Works

Reader is built upon Ulixee Hero, a headless browser specifically engineered for sophisticated web scraping tasks. Its core approach involves abstracting the complexities of browser architecture, anti-bot bypass techniques (including Cloudflare, TLS fingerprinting, and proxy infrastructure), and resource management. For data transformation, Reader utilizes supermarkdown, a Rust-based converter optimized for handling real-world, often malformed HTML, producing clean, LLM-friendly markdown. This dual focus on robust scraping and clean output differentiates it from simpler HTML-to-markdown tools.

Quick Start & Requirements

Installation: npm install @vakra-dev/reader
Prerequisites: Node.js version 18 or higher. For headless Linux servers (e.g., VPS, EC2), system dependencies for Chromium must be installed (e.g., libnspr4, libnss3, libatk1.0-0, etc.).
Documentation: Full documentation is available at docs.reader.dev.

Highlighted Details

Advanced Anti-Bot Bypass: Features Cloudflare bypass through TLS fingerprinting, DNS over TLS, and WebRTC masking.
Clean Output: Provides markdown and HTML with automatic main content extraction, removing common page artifacts like headers, footers, and cookie banners.
Flexible Usage: Offers both a Command Line Interface (CLI) and a programmatic API.
Browser Pooling: Manages a pool of browser instances with auto-recycling, health monitoring, and request queuing for efficient resource utilization.
Website Crawling: Includes functionality for Breadth-First Search (BFS) link discovery with configurable depth and page limits.
Proxy Support: Integrates datacenter and residential proxies with rotation strategies.

Maintenance & Community

The project is actively maintained, with Discord and GitHub Issues available for support and community interaction. The primary author is Nihal Kaul.

Licensing & Compatibility

Reader is licensed under the Apache 2.0 license. This license is permissive and generally compatible with commercial use and integration into closed-source projects.

Limitations & Caveats

Deploying on headless Linux servers requires careful installation of specific system dependencies for the underlying Chromium browser, similar to other headless browser tools like Puppeteer and Playwright. The effectiveness of anti-bot bypass mechanisms can vary as anti-scraping technologies evolve.

reader by vakra-dev

Explore Similar Projects

extractor by lightfeed

crw by us

teracrawl by BrowserCash

oxylabs-ai-studio-py by oxylabs

scraperai by scraperai

mcp by hyperbrowserai

webclaw by 0xMassi

AI-Web-Scraper by techwithtim

crawlee-python by apify

firecrawl-mcp-server by firecrawl

obscura by h4ckf0r0day

crawlee by apify