Crawling-Infrastructure by NikolaiT

Crawling infrastructure for scalable web scraping

Created 5 years ago

435 stars

Top 68.4% on SourcePulse

Project Summary

This project provides a distributed, serverless crawling infrastructure designed for large-scale web scraping, particularly for sites employing anti-bot measures. It targets developers needing to extract data from JavaScript-heavy websites or those with sophisticated bot detection, offering a flexible and cost-effective solution by leveraging cloud resources.

How It Works

The infrastructure utilizes a master-scheduler component to manage crawl tasks and allocate resources. Crawling endpoints can be deployed on various backends, including AWS Lambda, Azure Functions, or Docker Swarm/Kubernetes clusters, with a preference for cost-effective AWS Spot Instances. It employs a customized headless Chrome browser controlled via Puppeteer, incorporating techniques to evade bot detection by mimicking human browsing behavior, including browser fingerprinting and input patterns.

Quick Start & Requirements

Installation: Requires Node.js, npm, yarn, and TypeScript. Local compilation involves cd master/, npm install, cd ../lib/, tsc, cd ../master/, tsc.
Prerequisites: AWS account with programmatic access (Lambda, S3, CloudWatch permissions), Docker, Serverless framework.
Setup: Detailed AWS setup involves EC2 instance creation, IAM user configuration, Docker installation, Node.js/Yarn/TypeScript setup, and deployment scripts. AWS Lambda deployment requires configuring crawler/deploy_all.js and crawler.env.
Resources: A detailed tutorial outlines setting up an EC2 instance (t2.medium recommended), configuring security groups, and managing AWS credentials.
Documentation: Swagger API documentation is available at http://localhost:9001/swagger/.

Highlighted Details

Supports both basic HTTP crawling and advanced headless Chrome crawling with Puppeteer.
Designed to evade sophisticated anti-bot detection mechanisms.
Flexible deployment options: AWS Lambda, Azure Functions, Docker Swarm, Kubernetes.
Leverages cost-effective cloud infrastructure like AWS Spot Instances.
Integrates with external captcha solving services.

Maintenance & Community

The project is open-source, seeking community contributions. A SaaS service, Scrapeulous.com, is offered for users preferring a managed solution.

Licensing & Compatibility

The project is open-source, but the specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license terms.

Limitations & Caveats

The setup process is complex, requiring significant cloud infrastructure knowledge and configuration. The README mentions a "cat and mouse game" with anti-bot companies, implying ongoing maintenance may be needed to adapt to new detection techniques. The serverless architecture (e.g., AWS Lambda) has a 5-minute execution limit per function.

Crawling-Infrastructure by NikolaiT

Explore Similar Projects

oxylabs-ai-studio-py by oxylabs

google-search by web-agent-master

x-crawl by coder-hxl

ai.robots.txt by ai-robots-txt

WaterCrawl by watercrawl

CyberScraper-2077 by itsOwen

AI-Web-Scraper by techwithtim

brightdata-mcp by brightdata

AnyCrawl by any4ai

crawlee-python by apify

crawlee by apify

Scrapegraph-ai by ScrapeGraphAI