pymupdf4llm  by pymupdf

PDF content extraction and structuring for LLMs

Created 1 year ago
1,099 stars

Top 34.8% on SourcePulse

GitHubView on GitHub
Project Summary

PyMuPDF4LLM is a specialized Python extension of PyMuPDF, converting PDFs into structured Markdown optimized for Large Language Models (LLMs). It targets developers and researchers, particularly those using Retrieval Augmented Generation (RAG), by making complex PDF content easily digestible and semantically organized.

How It Works

This library uses PyMuPDF to parse PDFs and perform intelligent structure detection. It automatically identifies and preserves document hierarchy (headers, paragraphs, tables, images) and reading order. Its core advantage is outputting clean, structured Markdown, enhancing LLM performance by providing semantically rich text over raw extraction.

Quick Start & Requirements

Install via pip:

pip install -U pymupdf4llm

This command also installs PyMuPDF. Usage involves importing the library and calling to_markdown with a PDF file path or PyMuPDF Document object.

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")

Optional parameters include pages for specific page processing and page_chunks=True for generating page-level text segments (returning dictionaries). Multi-column layouts are supported. Image extraction (write_images=True) saves PNGs and references them in Markdown. The library supports XPS and eBook formats (e.g., MOBI) identically. Documentation links are not directly provided, but the PyMuPDF homepage is on GitHub.

Highlighted Details

  • Converts PDFs to GitHub-compatible Markdown, preserving document hierarchy and semantic structure.
  • Intelligent structure detection for headers, paragraphs, tables, and images.
  • Supports multi-column page processing and page-level text chunk generation.
  • Optional extraction and inline referencing of images.
  • Consistent interface for XPS and eBook formats.

Maintenance & Community

Maintained by Artifex Software, Inc. (developers of MuPDF/PyMuPDF). Community support is available via the #pymupdf Discord channel.

Licensing & Compatibility

Available under open-source AGPL and commercial licenses. Users unable to comply with AGPL must contact Artifex Software for commercial licensing. AGPL may impose copyleft restrictions.

Limitations & Caveats

The primary adoption consideration is the AGPL license, potentially restricting use in proprietary applications without a commercial license. No other specific limitations (e.g., alpha status, bugs) are detailed in the README.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
30 stars in the last 30 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

MinerU by opendatalab

0.9%
48k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.