deepscrape  by stretchcloud

Intelligent web scraping and LLM-powered data extraction

Created 5 months ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

DeepScrape provides an AI-powered solution for intelligent web scraping and data extraction, transforming websites into structured data formats like JSON and Markdown. It caters to developers building RAG pipelines, data workflows, and modern web applications, offering significant benefits through its flexible LLM integration and privacy-focused local processing capabilities.

How It Works

The system utilizes Playwright for robust browser automation, including stealth capabilities to mimic human users. It integrates with Large Language Models (LLMs), supporting both cloud providers like OpenAI and local deployments via Ollama, vLLM, or LocalAI, ensuring data privacy. Core functionality involves scraping web content, then using LLMs to extract specific information based on predefined JSON schemas or general summarization tasks. This approach allows for precise, context-aware data extraction from dynamic web pages.

Quick Start & Requirements

Installation involves cloning the repository, navigating into the directory, and running npm install. Configuration is managed via a .env file, requiring settings for the LLM provider (e.g., openai, ollama), API keys, and optionally Redis for the job queue. The server is started with npm run dev. Primary requirements include Node.js and npm. Docker is recommended for deployment, and Redis is needed for the background job queue.

Highlighted Details

  • LLM Extraction: Converts raw web content into structured JSON or Markdown using OpenAI or local LLMs.
  • Batch Processing: Enables efficient processing of multiple URLs concurrently, with options for controlled concurrency, retries, and various download formats (ZIP, consolidated JSON, individual files).
  • API-First Design: Exposes a comprehensive REST API documented with Swagger, facilitating integration into existing systems.
  • Browser Actions: Supports advanced browser interactions like clicking elements, filling forms, scrolling, and waiting for specific conditions.
  • Local LLM Support: Facilitates completely on-premises data processing for enhanced privacy and compliance.
  • Web Crawling: Includes multi-page crawling capabilities with configurable depth, strategies (BFS, DFS), and path filtering, automatically exporting results.

Maintenance & Community

The project appears to be under active development, with a roadmap outlining planned features. Specific details regarding maintainers, community channels (like Discord or Slack), or sponsorships are not detailed in the provided README.

Licensing & Compatibility

DeepScrape is released under the Apache 2.0 license. This permissive license allows for broad compatibility with commercial and closed-source projects, enabling integration without significant licensing restrictions.

Limitations & Caveats

While feature-rich, several advanced capabilities are still listed on the roadmap, including browser pooling, automatic schema generation, and a web UI playground. The project's reliance on Node.js and npm for setup requires familiarity with the JavaScript ecosystem.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.