crawlee by apify

Web scraping/browser automation library for building reliable crawlers

Created 9 years ago

20,992 stars

Top 2.1% on SourcePulse

View on GitHub

4 Experts Love This Project

Ross Wightman

Author of timm; CV at Hugging Face

Project Summary

Crawlee is a comprehensive Node.js library for web scraping and browser automation, designed to build reliable and efficient crawlers. It targets developers needing to extract data from websites for AI, LLMs, RAG, or GPT applications, supporting various data formats and browser automation tools.

How It Works

Crawlee provides a unified interface for both HTTP and headless browser crawling, abstracting away complexities of tools like Playwright and Puppeteer. It features a persistent queue for managing URLs, pluggable storage for scraped data, and built-in proxy rotation and session management. This approach allows crawlers to mimic human behavior, bypass bot protections, and scale automatically.

Quick Start & Requirements

Install via npm: npm install crawlee playwright
Requires Node.js 16 or higher.
Full documentation: https://crawlee.dev/docs/introduction
CLI quick start: npx crawlee create my-crawler

Highlighted Details

Supports Playwright, Puppeteer, Cheerio, JSDOM, and raw HTTP.
Offers zero-config HTTP2, TLS fingerprint replication, and automatic browser management.
Features customizable lifecycles with hooks and configurable routing.
Includes ready-to-deploy Dockerfiles.

Maintenance & Community

Developed by Apify.
Support channels: GitHub Issues, Stack Overflow, GitHub Discussions, Discord server.
Contribution guidelines available in CONTRIBUTING.md.

Licensing & Compatibility

Licensed under the Apache License 2.0.
Compatible with commercial use and closed-source projects.

Limitations & Caveats

Crawlee for Python is available for early adopters but is not the primary focus of this repository. The README mentions pre-release versions and potential dependency overrides if using the Apify SDK.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

227 stars in the last 30 days