spider by spider-rs

High-performance web crawler and scraper

Created 8 years ago

2,594 stars

Top 17.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

Summary

Spider-rs/spider is a high-performance web crawling and scraping framework built in Rust, designed for data curation workloads at scale. It targets developers and researchers needing to efficiently extract data from the web, offering robust solutions for handling JavaScript-rendered content, anti-bot measures, and complex automation tasks. The primary benefit is a fast, flexible, and production-ready crawling engine with advanced AI capabilities.

How It Works

The core architecture emphasizes concurrent crawling with streaming responses for real-time data processing. Spider offers flexible rendering options: standard HTTP requests, Chrome DevTools Protocol (CDP) for JavaScript-heavy sites with stealth capabilities, and WebDriver for integration with Selenium Grid or remote browsers. It incorporates built-in data processing utilities for HTML transformations and CSS/XPath scraping, alongside an AI agent (spider_agent) for sophisticated web automation and research synthesis across multiple LLM and search providers.

Quick Start & Requirements

For production, Spider Cloud offers a pay-per-use service ($1/GB data transfer) with no infrastructure management. For local development, integrate the spider crate into Rust projects via Cargo.toml (spider = "2"). Alternative interfaces include spider_cli for command-line usage, and spider-nodejs / spider-py for Node.js and Python projects, respectively. Advanced rendering requires enabling features like chrome (implying Chrome browser installation) or webdriver (requiring a WebDriver-compatible service like Selenium). Links to guides, API docs, and community chat are mentioned but not provided.

Highlighted Details

Concurrent and streaming crawls with real-time page subscriptions.
Decentralized crawling support for horizontal scaling.
Browser automation via CDP (with stealth/interception) and WebDriver.
spider_agent: A concurrent-safe multimodal AI agent for web automation, supporting multiple LLM and search providers.
Built-in web challenge solving (deterministic and AI-assisted).
Integrated caching (memory, disk, hybrid), proxy rotation, and cron job scheduling.

Maintenance & Community

Community interaction is facilitated via a chat channel, and contribution guidelines are available. Specific details on core maintainers, sponsorships, or project roadmap are not detailed in the provided text.

Licensing & Compatibility

The project is released under the permissive MIT license, allowing for commercial use and integration into closed-source applications without significant restrictions.

Limitations & Caveats

While powerful, setting up browser automation (CDP/WebDriver) requires external dependencies (browsers, drivers, or services). The AI agent features may incur costs associated with LLM and search API usage. No explicit alpha status or known bugs are mentioned.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

53 stars in the last 30 days