crawl4ai  by unclecode

Open-source web crawler/scraper for LLMs, AI agents, and data pipelines

created 1 year ago
50,034 stars

Top 0.5% on sourcepulse

GitHubView on GitHub
Project Summary

Crawl4AI is an open-source, high-performance web crawler and scraper designed for LLM-friendly data extraction. It targets developers and AI practitioners needing to efficiently gather and process web content for applications like RAG, fine-tuning, and AI agents. The project offers significant speed advantages and flexible deployment options.

How It Works

Crawl4AI leverages Playwright for browser automation, enabling control over headless or headed browser instances. It supports various extraction strategies, including heuristic-based Markdown generation optimized for LLMs, CSS/XPath-based structured data extraction, and LLM-driven extraction using providers like OpenAI. Advanced features include dynamic content handling, session management, proxy support, and network/console traffic capture.

Quick Start & Requirements

  • Install via pip: pip install -U crawl4ai
  • Manual Playwright installation if needed: python -m playwright install --with-deps chromium
  • Python 3.x
  • Documentation: docs.crawl4ai.com

Highlighted Details

  • Generates AI-optimized Markdown with heuristic filtering and BM25 for relevance.
  • Supports LLM-driven and CSS/XPath-based structured data extraction.
  • Offers flexible browser control, including persistent profiles and proxy support.
  • Provides a CLI tool (crwl) for direct command-line operations.
  • Features a Docker image for streamlined deployment with an interactive playground.

Maintenance & Community

  • Actively maintained with frequent releases (e.g., v0.6.0).
  • Discord community available: discord.gg/jP8KfhDhyN
  • Roadmap and contribution guidelines are accessible.

Licensing & Compatibility

  • Licensed under Apache License 2.0 with a mandatory attribution clause.
  • Attribution can be via badge or text in documentation.
  • Compatible with commercial and closed-source applications, provided attribution is met.

Limitations & Caveats

The project is rapidly evolving, with features like "Graph Crawler" and "Question-Based Crawler" still in the development roadmap. While the core functionality is robust, users should monitor release notes for potential breaking changes or new experimental features.

Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
21
Issues (30d)
70
Star History
8,370 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.