pymupdf4llm by pymupdf

PDF content extraction and structuring for LLMs

Created 1 year ago

1,325 stars

Top 29.9% on SourcePulse

Project Summary

PyMuPDF4LLM is a specialized Python extension of PyMuPDF, converting PDFs into structured Markdown optimized for Large Language Models (LLMs). It targets developers and researchers, particularly those using Retrieval Augmented Generation (RAG), by making complex PDF content easily digestible and semantically organized.

How It Works

This library uses PyMuPDF to parse PDFs and perform intelligent structure detection. It automatically identifies and preserves document hierarchy (headers, paragraphs, tables, images) and reading order. Its core advantage is outputting clean, structured Markdown, enhancing LLM performance by providing semantically rich text over raw extraction.

Quick Start & Requirements

Install via pip:

pip install -U pymupdf4llm

This command also installs PyMuPDF. Usage involves importing the library and calling to_markdown with a PDF file path or PyMuPDF Document object.

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")

Optional parameters include pages for specific page processing and page_chunks=True for generating page-level text segments (returning dictionaries). Multi-column layouts are supported. Image extraction (write_images=True) saves PNGs and references them in Markdown. The library supports XPS and eBook formats (e.g., MOBI) identically. Documentation links are not directly provided, but the PyMuPDF homepage is on GitHub.

Highlighted Details

Converts PDFs to GitHub-compatible Markdown, preserving document hierarchy and semantic structure.
Intelligent structure detection for headers, paragraphs, tables, and images.
Supports multi-column page processing and page-level text chunk generation.
Optional extraction and inline referencing of images.
Consistent interface for XPS and eBook formats.

Maintenance & Community

Maintained by Artifex Software, Inc. (developers of MuPDF/PyMuPDF). Community support is available via the #pymupdf Discord channel.

Licensing & Compatibility

Available under open-source AGPL and commercial licenses. Users unable to comply with AGPL must contact Artifex Software for commercial licensing. AGPL may impose copyleft restrictions.

Limitations & Caveats

The primary adoption consideration is the AGPL license, potentially restricting use in proprietary applications without a commercial license. No other specific limitations (e.g., alpha status, bugs) are detailed in the README.

pymupdf4llm by pymupdf

Explore Similar Projects

llmdocparser by lazyFrogLOL

dom-to-semantic-markdown by romansky

documind by DocumindHQ

vision-parse by iamarunbrahma

paperless-gpt by icereed

OpenContracts by Open-Source-Legal

ExtractThinker by enoch3712

llmsherpa by nlmatics

nlm-ingestor by nlmatics

pdf-craft by oomol-lab

MinerU by opendatalab

ragflow by infiniflow