crawl4ai by unclecode

Open-source web crawler/scraper for LLMs, AI agents, and data pipelines

Created 1 year ago

58,388 stars

Top 0.4% on SourcePulse

View on GitHub

13 Experts Love This Project

Boris Cherny

Creator of Claude Code; MTS at Anthropic

Tobi Lutke

Cofounder of Shopify

Amanpreet Singh

Cofounder of Contextual AI

Casper Hansen

Author of AutoAWQ

and 9 more!

Project Summary

Crawl4AI is an open-source, high-performance web crawler and scraper designed for LLM-friendly data extraction. It targets developers and AI practitioners needing to efficiently gather and process web content for applications like RAG, fine-tuning, and AI agents. The project offers significant speed advantages and flexible deployment options.

How It Works

Crawl4AI leverages Playwright for browser automation, enabling control over headless or headed browser instances. It supports various extraction strategies, including heuristic-based Markdown generation optimized for LLMs, CSS/XPath-based structured data extraction, and LLM-driven extraction using providers like OpenAI. Advanced features include dynamic content handling, session management, proxy support, and network/console traffic capture.

Quick Start & Requirements

Install via pip: pip install -U crawl4ai
Manual Playwright installation if needed: python -m playwright install --with-deps chromium
Python 3.x
Documentation: docs.crawl4ai.com

Highlighted Details

Generates AI-optimized Markdown with heuristic filtering and BM25 for relevance.
Supports LLM-driven and CSS/XPath-based structured data extraction.
Offers flexible browser control, including persistent profiles and proxy support.
Provides a CLI tool (crwl) for direct command-line operations.
Features a Docker image for streamlined deployment with an interactive playground.

Maintenance & Community

Actively maintained with frequent releases (e.g., v0.6.0).
Discord community available: discord.gg/jP8KfhDhyN
Roadmap and contribution guidelines are accessible.

Licensing & Compatibility

Licensed under Apache License 2.0 with a mandatory attribution clause.
Attribution can be via badge or text in documentation.
Compatible with commercial and closed-source applications, provided attribution is met.

Limitations & Caveats

The project is rapidly evolving, with features like "Graph Crawler" and "Question-Based Crawler" still in the development roadmap. While the core functionality is robust, users should monitor release notes for potential breaking changes or new experimental features.

Health Check

Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1,425 stars in the last 30 days