markdown-crawler by paulpierre

CLI tool for web crawling to markdown conversion, optimized for LLM RAG

Created 2 years ago

422 stars

Top 69.7% on SourcePulse

Project Summary

This project provides a multithreaded web crawler that converts website content into individual Markdown files, optimized for Retrieval Augmented Generation (RAG) pipelines. It targets developers and researchers needing to efficiently process and structure web data for LLM applications, offering a human-readable and compact format.

How It Works

The crawler recursively navigates a website, respecting configurable depth limits and domain constraints. It leverages BeautifulSoup for HTML parsing and markdownify to convert HTML content into Markdown, preserving structure like tables and images. The multithreaded design accelerates the crawling process, and features like resuming interrupted crawls and URL/HTML validation enhance robustness.

Quick Start & Requirements

Install via pip: pip install markdown-crawler
Execute CLI: markdown-crawler -t 5 -d 3 -b ./markdown https://en.wikipedia.org/wiki/Morty_Smith
Requirements: Python 3.x, BeautifulSoup4, requests, markdownify.
Official docs: https://github.com/paulpierre/markdown-crawler

Highlighted Details

Designed for LLM RAG and fine-tuning use cases.
Supports resuming crawls, configurable depth, and domain matching.
Utilizes BeautifulSoup for parsing and markdownify for conversion.
Includes a CLI interface and a Python library for integration.

Maintenance & Community

The project is maintained by @paulpierre. Further community engagement can be found via GitHub issues and pull requests.

Licensing & Compatibility

The software is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The crawler's effectiveness may depend on the target website's structure and anti-scraping measures. Specific content extraction relies on CSS selectors, which might require adjustment for different site layouts.

markdown-crawler by paulpierre

Explore Similar Projects

create-llmstxt-py by firecrawl

Craw4LLM by cxcscmu

doctor by sisig-ai

llmstxt-generator by firecrawl

ii-researcher by Intelligent-Internet

parsera by raznem

tavily-python by tavily-ai

sitefetch by egoist

tap4-ai-crawler by 6677-ai

AnyCrawl by any4ai

trafilatura by adbar

firecrawl by firecrawl