Discover and explore top open-source AI tools and projects—updated daily.
Transform web content into LLM-ready data
Top 29.6% on SourcePulse
WaterCrawl is a self-hosted, open-source web crawling and scraping application designed to transform web content into LLM-ready data. It targets developers and researchers needing to gather and process information from the web at scale, offering advanced crawling, search capabilities, and integrations with AI platforms.
How It Works
WaterCrawl utilizes a Python, Django, Scrapy, and Celery stack for asynchronous web crawling and data extraction. It employs customizable crawling options for depth, speed, and targeting, alongside a multi-language search engine with country-specific targeting. Results are processed asynchronously with real-time progress monitoring via Server-Sent Events (SSE).
Quick Start & Requirements
docker
directory, copy .env.example
to .env
, and run docker compose up -d
..env
for MINIO_EXTERNAL_ENDPOINT
, MINIO_BROWSER_REDIRECT_URL
, and MINIO_SERVER_URL
if not deploying on localhost
.Highlighted Details
Maintenance & Community
support@watercrawl.dev
for security disclosures.Licensing & Compatibility
Limitations & Caveats
The project is self-hosted and requires careful configuration of environment variables, particularly for non-localhost deployments, to ensure proper functionality of file uploads and downloads. Some integrations like Langflow are in a pull request and not yet merged.
2 weeks ago
Inactive