markdown-crawler  by paulpierre

CLI tool for web crawling to markdown conversion, optimized for LLM RAG

Created 1 year ago
402 stars

Top 72.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a multithreaded web crawler that converts website content into individual Markdown files, optimized for Retrieval Augmented Generation (RAG) pipelines. It targets developers and researchers needing to efficiently process and structure web data for LLM applications, offering a human-readable and compact format.

How It Works

The crawler recursively navigates a website, respecting configurable depth limits and domain constraints. It leverages BeautifulSoup for HTML parsing and markdownify to convert HTML content into Markdown, preserving structure like tables and images. The multithreaded design accelerates the crawling process, and features like resuming interrupted crawls and URL/HTML validation enhance robustness.

Quick Start & Requirements

  • Install via pip: pip install markdown-crawler
  • Execute CLI: markdown-crawler -t 5 -d 3 -b ./markdown https://en.wikipedia.org/wiki/Morty_Smith
  • Requirements: Python 3.x, BeautifulSoup4, requests, markdownify.
  • Official docs: https://github.com/paulpierre/markdown-crawler

Highlighted Details

  • Designed for LLM RAG and fine-tuning use cases.
  • Supports resuming crawls, configurable depth, and domain matching.
  • Utilizes BeautifulSoup for parsing and markdownify for conversion.
  • Includes a CLI interface and a Python library for integration.

Maintenance & Community

The project is maintained by @paulpierre. Further community engagement can be found via GitHub issues and pull requests.

Licensing & Compatibility

The software is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The crawler's effectiveness may depend on the target website's structure and anti-scraping measures. Specific content extraction relies on CSS selectors, which might require adjustment for different site layouts.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.