WaterCrawl  by watercrawl

Transform web content into LLM-ready data

Created 9 months ago
1,359 stars

Top 29.6% on SourcePulse

GitHubView on GitHub
Project Summary

WaterCrawl is a self-hosted, open-source web crawling and scraping application designed to transform web content into LLM-ready data. It targets developers and researchers needing to gather and process information from the web at scale, offering advanced crawling, search capabilities, and integrations with AI platforms.

How It Works

WaterCrawl utilizes a Python, Django, Scrapy, and Celery stack for asynchronous web crawling and data extraction. It employs customizable crawling options for depth, speed, and targeting, alongside a multi-language search engine with country-specific targeting. Results are processed asynchronously with real-time progress monitoring via Server-Sent Events (SSE).

Quick Start & Requirements

  • Install/Run: Clone the repository, navigate to the docker directory, copy .env.example to .env, and run docker compose up -d.
  • Prerequisites: Docker.
  • Configuration: Update .env for MINIO_EXTERNAL_ENDPOINT, MINIO_BROWSER_REDIRECT_URL, and MINIO_SERVER_URL if not deploying on localhost.
  • Links: Quick Start, Deployment Guide, API Overview.

Highlighted Details

  • Advanced web crawling and scraping with customizable options.
  • Multi-language support with country-specific targeting.
  • REST API with OpenAPI documentation and client SDKs (Python, Node.js, Go, PHP).
  • Integrations with Dify, N8N, and other AI/automation platforms.

Maintenance & Community

  • Active development indicated by recent releases and GitHub Actions for tests.
  • Support channel: support@watercrawl.dev for security disclosures.

Licensing & Compatibility

  • License: WaterCrawl License (MIT with additional restrictions).
  • Compatibility: Suitable for self-hosting; commercial use restrictions may apply due to the custom license.

Limitations & Caveats

The project is self-hosted and requires careful configuration of environment variables, particularly for non-localhost deployments, to ensure proper functionality of file uploads and downloads. Some integrations like Langflow are in a pull request and not yet merged.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
26
Issues (30d)
6
Star History
141 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.