Discover and explore top open-source AI tools and projects—updated daily.
watercrawlTransform web content into LLM-ready data
Top 27.4% on SourcePulse
WaterCrawl is a self-hosted, open-source web crawling and scraping application designed to transform web content into LLM-ready data. It targets developers and researchers needing to gather and process information from the web at scale, offering advanced crawling, search capabilities, and integrations with AI platforms.
How It Works
WaterCrawl utilizes a Python, Django, Scrapy, and Celery stack for asynchronous web crawling and data extraction. It employs customizable crawling options for depth, speed, and targeting, alongside a multi-language search engine with country-specific targeting. Results are processed asynchronously with real-time progress monitoring via Server-Sent Events (SSE).
Quick Start & Requirements
docker directory, copy .env.example to .env, and run docker compose up -d..env for MINIO_EXTERNAL_ENDPOINT, MINIO_BROWSER_REDIRECT_URL, and MINIO_SERVER_URL if not deploying on localhost.Highlighted Details
Maintenance & Community
support@watercrawl.dev for security disclosures.Licensing & Compatibility
Limitations & Caveats
The project is self-hosted and requires careful configuration of environment variables, particularly for non-localhost deployments, to ensure proper functionality of file uploads and downloads. Some integrations like Langflow are in a pull request and not yet merged.
22 hours ago
Inactive
Intelligent-Internet
adbar
ScrapeGraphAI