crawlee-python by apify

Python library for web scraping and browser automation

Created 2 years ago

7,369 stars

Top 6.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Rodrigo Nader

Cofounder of Langflow

Project Summary

Crawlee for Python is a comprehensive library for building reliable web scrapers and automating browser interactions. It targets developers needing to extract data for AI, LLMs, or RAG applications, offering a unified interface for both raw HTTP requests and headless browser automation, with built-in proxy rotation and robust error handling.

How It Works

Crawlee provides two primary crawler types: BeautifulSoupCrawler for efficient HTML parsing via HTTP requests, and PlaywrightCrawler for JavaScript-heavy sites using headless browsers. This dual approach allows users to select the most performant method for their specific needs. Its asynchronous, asyncio-based architecture and extensive configuration options enable fine-grained control over crawling behavior, retries, and data storage.

Quick Start & Requirements

Install with: python -m pip install 'crawlee[all]'
Install Playwright dependencies: playwright install
Verify installation: python -c 'import crawlee; print(crawlee.__version__)'
Full documentation: https://crawlee.dev/docs/intro
Examples: https://crawlee.dev/examples

Highlighted Details

Unified interface for HTTP and headless browser crawling.
Asyncio-based for high performance and compatibility.
Automatic retries, proxy rotation, and session management.
Configurable request routing and pluggable storage.

Maintenance & Community

Developed by Apify.
Support channels: GitHub Issues, Stack Overflow, GitHub Discussions, Discord server.
Contribution guidelines available in CONTRIBUTING.md.

Licensing & Compatibility

Licensed under Apache License 2.0.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The library is open to early adopters, suggesting potential for ongoing development and API changes. While it aims to bypass bot protections, effectiveness may vary against sophisticated anti-bot measures.

crawlee-python by apify

Explore Similar Projects

oxylabs-ai-studio-py by oxylabs

dendrite-python-sdk by dendrite-systems

gpt4V-scraper by vdutts7

gpt-automated-web-scraper by djb-gt

mcp by hyperbrowserai

AI-Web-Scraper by techwithtim

trafilatura by adbar

mcp-chrome by hangwin

steel-browser by steel-dev

crawlee by apify

Scrapegraph-ai by ScrapeGraphAI

suna by kortix-ai