AnyCrawl  by any4ai

Web crawler for LLM-ready data extraction

created 4 months ago
1,055 stars

Top 36.4% on sourcepulse

GitHubView on GitHub
Project Summary

AnyCrawl is a high-performance Node.js/TypeScript web crawler and scraper designed for extracting structured data from websites and search engine results pages (SERPs). It targets developers and researchers needing to process large volumes of web data, particularly for LLM applications, offering efficient multi-threaded crawling and SERP extraction from multiple search engines.

How It Works

AnyCrawl employs a multi-threading and multi-process architecture for high performance. It supports multiple scraping engines, including cheerio for fast static HTML parsing and playwright/puppeteer for JavaScript-rendered content. This flexibility allows users to choose the most efficient engine for their specific needs, balancing speed and rendering capabilities. The system is optimized for LLM readiness, implying structured output formats suitable for AI model consumption.

Quick Start & Requirements

  • Install/Run: docker compose up --build
  • Prerequisites: Docker. Environment variables can be configured for proxy, SSL error handling, database type (SQLite or PostgreSQL), and API authentication.
  • Documentation: https://docs.anycrawl.dev

Highlighted Details

  • Supports SERP crawling for Google, Bing, and Baidu with batch processing.
  • Offers efficient single-page, site-wide, and intelligent traversal crawling.
  • Provides API endpoints for direct scraping and SERP queries.
  • Configurable via environment variables for various operational aspects.

Maintenance & Community

The project is developed by the Any4AI team, with a stated mission to build foundational products for the AI ecosystem. Contributions are welcomed.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is primarily documented via Docker deployment, with limited information on native installation or direct Node.js module usage. While it lists multiple search engines, detailed support or specific configurations for Bing/Baidu are not elaborated upon in the README.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
7
Star History
1,399 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.