teracrawl by BrowserCash

High-performance web crawler API for LLM data extraction

Created 8 months ago

272 stars

Top 94.6% on SourcePulse

Project Summary

Teracrawl is a high-performance, production-ready API designed to convert web content into clean, LLM-ready Markdown. It addresses the challenges of JavaScript rendering, anti-bot measures, and complex HTML structures, making real-time data accessible to AI systems. The tool is ideal for engineers and researchers building applications that require robust web scraping and data extraction for natural language processing tasks.

How It Works

Teracrawl leverages managed remote Chrome browsers, powered by Browser.cash, to ensure high success rates even on protected websites. Its core innovation lies in a "Smart Two-Phase Crawling" approach: a Fast Mode optimized for static/SSR pages and a Dynamic Mode that automatically falls back for complex Single Page Applications (SPAs) by waiting for rendering. It also offers a combined "Search + Scrape" endpoint to query Google and scrape top results in parallel, converting raw HTML into semantic Markdown suitable for Retrieval Augmented Generation (RAG) and LLM context windows.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/BrowserCash/teracrawl.git), navigate into the directory (cd teracrawl), and install dependencies (npm install).
Prerequisites: Node.js 18+ and a Browser.cash API Key are required. A running instance of browser-serp on port 8080 is necessary for the /crawl endpoint.
Running: Use npm run dev for development or npm run build followed by npm start for production. The server typically runs at http://0.0.0.0:8085.
Docker: Docker support is provided via a Dockerfile and a Docker Compose example for easier deployment.
Links: GitHub Repository

Highlighted Details

Achieves #1 coverage (84.2%) on the scrape-evals benchmark for web scrapers.
Features a unified /crawl endpoint for Google search and scraping top results, and a /scrape endpoint for direct URL conversion.
Employs smart content extraction to identify main content areas and remove clutter, along with safety features like blocking ads/trackers and removing base64 images.
Built with a robust session pool for high concurrency and includes automatic timeout handling and error recovery.

Maintenance & Community

The project welcomes contributions via pull requests. Specific community channels (like Discord/Slack) or notable maintainers/sponsors are not detailed in the README.

Licensing & Compatibility

This project is licensed under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The search functionality (/crawl endpoint) is dependent on a separately running browser-serp service. Users must obtain and configure a Browser.cash API key.

Health Check

Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days