CLI tool for web crawling to markdown conversion, optimized for LLM RAG
Top 74.3% on sourcepulse
This project provides a multithreaded web crawler that converts website content into individual Markdown files, optimized for Retrieval Augmented Generation (RAG) pipelines. It targets developers and researchers needing to efficiently process and structure web data for LLM applications, offering a human-readable and compact format.
How It Works
The crawler recursively navigates a website, respecting configurable depth limits and domain constraints. It leverages BeautifulSoup for HTML parsing and markdownify to convert HTML content into Markdown, preserving structure like tables and images. The multithreaded design accelerates the crawling process, and features like resuming interrupted crawls and URL/HTML validation enhance robustness.
Quick Start & Requirements
pip install markdown-crawler
markdown-crawler -t 5 -d 3 -b ./markdown https://en.wikipedia.org/wiki/Morty_Smith
Highlighted Details
Maintenance & Community
The project is maintained by @paulpierre. Further community engagement can be found via GitHub issues and pull requests.
Licensing & Compatibility
The software is released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The crawler's effectiveness may depend on the target website's structure and anti-scraping measures. Specific content extraction relies on CSS selectors, which might require adjustment for different site layouts.
11 months ago
1 week