Discover and explore top open-source AI tools and projects—updated daily.
pymupdfPDF content extraction and structuring for LLMs
Top 34.8% on SourcePulse
PyMuPDF4LLM is a specialized Python extension of PyMuPDF, converting PDFs into structured Markdown optimized for Large Language Models (LLMs). It targets developers and researchers, particularly those using Retrieval Augmented Generation (RAG), by making complex PDF content easily digestible and semantically organized.
How It Works
This library uses PyMuPDF to parse PDFs and perform intelligent structure detection. It automatically identifies and preserves document hierarchy (headers, paragraphs, tables, images) and reading order. Its core advantage is outputting clean, structured Markdown, enhancing LLM performance by providing semantically rich text over raw extraction.
Quick Start & Requirements
Install via pip:
pip install -U pymupdf4llm
This command also installs PyMuPDF. Usage involves importing the library and calling to_markdown with a PDF file path or PyMuPDF Document object.
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")
Optional parameters include pages for specific page processing and page_chunks=True for generating page-level text segments (returning dictionaries). Multi-column layouts are supported. Image extraction (write_images=True) saves PNGs and references them in Markdown. The library supports XPS and eBook formats (e.g., MOBI) identically. Documentation links are not directly provided, but the PyMuPDF homepage is on GitHub.
Highlighted Details
Maintenance & Community
Maintained by Artifex Software, Inc. (developers of MuPDF/PyMuPDF). Community support is available via the #pymupdf Discord channel.
Licensing & Compatibility
Available under open-source AGPL and commercial licenses. Users unable to comply with AGPL must contact Artifex Software for commercial licensing. AGPL may impose copyleft restrictions.
Limitations & Caveats
The primary adoption consideration is the AGPL license, potentially restricting use in proprietary applications without a commercial license. No other specific limitations (e.g., alpha status, bugs) are detailed in the README.
1 day ago
Inactive
romansky
nlmatics
opendatalab