WaterCrawl by watercrawl

Transform web content into LLM-ready data

Created 1 year ago

1,802 stars

Top 23.5% on SourcePulse

Project Summary

WaterCrawl is a self-hosted, open-source web crawling and scraping application designed to transform web content into LLM-ready data. It targets developers and researchers needing to gather and process information from the web at scale, offering advanced crawling, search capabilities, and integrations with AI platforms.

How It Works

WaterCrawl utilizes a Python, Django, Scrapy, and Celery stack for asynchronous web crawling and data extraction. It employs customizable crawling options for depth, speed, and targeting, alongside a multi-language search engine with country-specific targeting. Results are processed asynchronously with real-time progress monitoring via Server-Sent Events (SSE).

Quick Start & Requirements

Install/Run: Clone the repository, navigate to the docker directory, copy .env.example to .env, and run docker compose up -d.
Prerequisites: Docker.
Configuration: Update .env for MINIO_EXTERNAL_ENDPOINT, MINIO_BROWSER_REDIRECT_URL, and MINIO_SERVER_URL if not deploying on localhost.
Links: Quick Start, Deployment Guide, API Overview.

Highlighted Details

Advanced web crawling and scraping with customizable options.
Multi-language support with country-specific targeting.
REST API with OpenAPI documentation and client SDKs (Python, Node.js, Go, PHP).
Integrations with Dify, N8N, and other AI/automation platforms.

Maintenance & Community

Active development indicated by recent releases and GitHub Actions for tests.
Support channel: support@watercrawl.dev for security disclosures.

Licensing & Compatibility

License: WaterCrawl License (MIT with additional restrictions).
Compatibility: Suitable for self-hosting; commercial use restrictions may apply due to the custom license.

Limitations & Caveats

The project is self-hosted and requires careful configuration of environment variables, particularly for non-localhost deployments, to ensure proper functionality of file uploads and downloads. Some integrations like Langflow are in a pull request and not yet merged.

WaterCrawl by watercrawl

Explore Similar Projects

oxylabs-ai-studio-py by oxylabs

Crawling-Infrastructure by NikolaiT

scraperai by scraperai

doctor by sisig-ai

ii-researcher by Intelligent-Internet

x-crawl by coder-hxl

open-scouts by firecrawl

AI-Web-Scraper by techwithtim

AnyCrawl by any4ai

CyberScraper-2077 by itsOwen

trafilatura by adbar

Scrapegraph-ai by ScrapeGraphAI