Crawling infrastructure for scalable web scraping
Top 70.1% on sourcepulse
This project provides a distributed, serverless crawling infrastructure designed for large-scale web scraping, particularly for sites employing anti-bot measures. It targets developers needing to extract data from JavaScript-heavy websites or those with sophisticated bot detection, offering a flexible and cost-effective solution by leveraging cloud resources.
How It Works
The infrastructure utilizes a master-scheduler component to manage crawl tasks and allocate resources. Crawling endpoints can be deployed on various backends, including AWS Lambda, Azure Functions, or Docker Swarm/Kubernetes clusters, with a preference for cost-effective AWS Spot Instances. It employs a customized headless Chrome browser controlled via Puppeteer, incorporating techniques to evade bot detection by mimicking human browsing behavior, including browser fingerprinting and input patterns.
Quick Start & Requirements
cd master/
, npm install
, cd ../lib/
, tsc
, cd ../master/
, tsc
.crawler/deploy_all.js
and crawler.env
.http://localhost:9001/swagger/
.Highlighted Details
Maintenance & Community
The project is open-source, seeking community contributions. A SaaS service, Scrapeulous.com, is offered for users preferring a managed solution.
Licensing & Compatibility
The project is open-source, but the specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license terms.
Limitations & Caveats
The setup process is complex, requiring significant cloud infrastructure knowledge and configuration. The README mentions a "cat and mouse game" with anti-bot companies, implying ongoing maintenance may be needed to adapt to new detection techniques. The serverless architecture (e.g., AWS Lambda) has a 5-minute execution limit per function.
2 years ago
1 week