markdown-crawler  by paulpierre

CLI tool for web crawling to markdown conversion, optimized for LLM RAG

created 1 year ago
393 stars

Top 74.3% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a multithreaded web crawler that converts website content into individual Markdown files, optimized for Retrieval Augmented Generation (RAG) pipelines. It targets developers and researchers needing to efficiently process and structure web data for LLM applications, offering a human-readable and compact format.

How It Works

The crawler recursively navigates a website, respecting configurable depth limits and domain constraints. It leverages BeautifulSoup for HTML parsing and markdownify to convert HTML content into Markdown, preserving structure like tables and images. The multithreaded design accelerates the crawling process, and features like resuming interrupted crawls and URL/HTML validation enhance robustness.

Quick Start & Requirements

  • Install via pip: pip install markdown-crawler
  • Execute CLI: markdown-crawler -t 5 -d 3 -b ./markdown https://en.wikipedia.org/wiki/Morty_Smith
  • Requirements: Python 3.x, BeautifulSoup4, requests, markdownify.
  • Official docs: https://github.com/paulpierre/markdown-crawler

Highlighted Details

  • Designed for LLM RAG and fine-tuning use cases.
  • Supports resuming crawls, configurable depth, and domain matching.
  • Utilizes BeautifulSoup for parsing and markdownify for conversion.
  • Includes a CLI interface and a Python library for integration.

Maintenance & Community

The project is maintained by @paulpierre. Further community engagement can be found via GitHub issues and pull requests.

Licensing & Compatibility

The software is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The crawler's effectiveness may depend on the target website's structure and anti-scraping measures. Specific content extraction relies on CSS selectors, which might require adjustment for different site layouts.

Health Check
Last commit

11 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.