crawlee-python  by apify

Python library for web scraping and browser automation

created 1 year ago
6,104 stars

Top 8.6% on sourcepulse

GitHubView on GitHub
Project Summary

Crawlee for Python is a comprehensive library for building reliable web scrapers and automating browser interactions. It targets developers needing to extract data for AI, LLMs, or RAG applications, offering a unified interface for both raw HTTP requests and headless browser automation, with built-in proxy rotation and robust error handling.

How It Works

Crawlee provides two primary crawler types: BeautifulSoupCrawler for efficient HTML parsing via HTTP requests, and PlaywrightCrawler for JavaScript-heavy sites using headless browsers. This dual approach allows users to select the most performant method for their specific needs. Its asynchronous, asyncio-based architecture and extensive configuration options enable fine-grained control over crawling behavior, retries, and data storage.

Quick Start & Requirements

Highlighted Details

  • Unified interface for HTTP and headless browser crawling.
  • Asyncio-based for high performance and compatibility.
  • Automatic retries, proxy rotation, and session management.
  • Configurable request routing and pluggable storage.

Maintenance & Community

  • Developed by Apify.
  • Support channels: GitHub Issues, Stack Overflow, GitHub Discussions, Discord server.
  • Contribution guidelines available in CONTRIBUTING.md.

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The library is open to early adopters, suggesting potential for ongoing development and API changes. While it aims to bypass bot protections, effectiveness may vary against sophisticated anti-bot measures.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
43
Issues (30d)
25
Star History
547 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

client-python by mistralai

0.3%
628
Python SDK for Mistral AI platform
created 1 year ago
updated 1 week ago
Starred by Adam Wolff Adam Wolff(Claude Code Core; MTS at Anthropic), Samuel Colvin Samuel Colvin(Author of Pydantic, Pydantic Logfire, PydanticAI), and
3 more.

anthropic-sdk-python by anthropics

0.6%
2k
Python SDK for Anthropic's REST API
created 2 years ago
updated 8 hours ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

2.1%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 15 hours ago
Feedback? Help us improve.