Crawling-Infrastructure  by NikolaiT

Crawling infrastructure for scalable web scraping

created 5 years ago
430 stars

Top 70.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a distributed, serverless crawling infrastructure designed for large-scale web scraping, particularly for sites employing anti-bot measures. It targets developers needing to extract data from JavaScript-heavy websites or those with sophisticated bot detection, offering a flexible and cost-effective solution by leveraging cloud resources.

How It Works

The infrastructure utilizes a master-scheduler component to manage crawl tasks and allocate resources. Crawling endpoints can be deployed on various backends, including AWS Lambda, Azure Functions, or Docker Swarm/Kubernetes clusters, with a preference for cost-effective AWS Spot Instances. It employs a customized headless Chrome browser controlled via Puppeteer, incorporating techniques to evade bot detection by mimicking human browsing behavior, including browser fingerprinting and input patterns.

Quick Start & Requirements

  • Installation: Requires Node.js, npm, yarn, and TypeScript. Local compilation involves cd master/, npm install, cd ../lib/, tsc, cd ../master/, tsc.
  • Prerequisites: AWS account with programmatic access (Lambda, S3, CloudWatch permissions), Docker, Serverless framework.
  • Setup: Detailed AWS setup involves EC2 instance creation, IAM user configuration, Docker installation, Node.js/Yarn/TypeScript setup, and deployment scripts. AWS Lambda deployment requires configuring crawler/deploy_all.js and crawler.env.
  • Resources: A detailed tutorial outlines setting up an EC2 instance (t2.medium recommended), configuring security groups, and managing AWS credentials.
  • Documentation: Swagger API documentation is available at http://localhost:9001/swagger/.

Highlighted Details

  • Supports both basic HTTP crawling and advanced headless Chrome crawling with Puppeteer.
  • Designed to evade sophisticated anti-bot detection mechanisms.
  • Flexible deployment options: AWS Lambda, Azure Functions, Docker Swarm, Kubernetes.
  • Leverages cost-effective cloud infrastructure like AWS Spot Instances.
  • Integrates with external captcha solving services.

Maintenance & Community

The project is open-source, seeking community contributions. A SaaS service, Scrapeulous.com, is offered for users preferring a managed solution.

Licensing & Compatibility

The project is open-source, but the specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license terms.

Limitations & Caveats

The setup process is complex, requiring significant cloud infrastructure knowledge and configuration. The README mentions a "cat and mouse game" with anti-bot companies, implying ongoing maintenance may be needed to adapt to new detection techniques. The serverless architecture (e.g., AWS Lambda) has a 5-minute execution limit per function.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

2.1%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 15 hours ago
Feedback? Help us improve.