crawlee-python  by apify

Python library for web scraping and browser automation

Created 1 year ago
6,313 stars

Top 8.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Crawlee for Python is a comprehensive library for building reliable web scrapers and automating browser interactions. It targets developers needing to extract data for AI, LLMs, or RAG applications, offering a unified interface for both raw HTTP requests and headless browser automation, with built-in proxy rotation and robust error handling.

How It Works

Crawlee provides two primary crawler types: BeautifulSoupCrawler for efficient HTML parsing via HTTP requests, and PlaywrightCrawler for JavaScript-heavy sites using headless browsers. This dual approach allows users to select the most performant method for their specific needs. Its asynchronous, asyncio-based architecture and extensive configuration options enable fine-grained control over crawling behavior, retries, and data storage.

Quick Start & Requirements

Highlighted Details

  • Unified interface for HTTP and headless browser crawling.
  • Asyncio-based for high performance and compatibility.
  • Automatic retries, proxy rotation, and session management.
  • Configurable request routing and pluggable storage.

Maintenance & Community

  • Developed by Apify.
  • Support channels: GitHub Issues, Stack Overflow, GitHub Discussions, Discord server.
  • Contribution guidelines available in CONTRIBUTING.md.

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The library is open to early adopters, suggesting potential for ongoing development and API changes. While it aims to bypass bot protections, effectiveness may vary against sophisticated anti-bot measures.

Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
38
Issues (30d)
16
Star History
148 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.