PyMuPDF by pymupdf

High-performance PDF toolkit for data extraction and AI pipelines

Created 14 years ago

10,289 stars

Top 5.1% on SourcePulse

View on GitHub

7 Experts Love This Project

Cofounder of Prime Intellect

and 3 more!

Project Summary

PyMuPDF is a high-performance Python library for comprehensive PDF (and other document) manipulation, data extraction, and conversion. Built upon the efficient MuPDF C engine, it offers developers precise, low-level control alongside convenient high-level APIs. It is particularly beneficial for AI pipelines and RAG systems requiring robust, local document processing without mandatory external dependencies.

How It Works

PyMuPDF leverages MuPDF, a lightweight and fast C rendering engine, to achieve its performance. This C-based foundation allows for rapid processing of PDF documents, including text extraction with detailed metadata, page rendering, image extraction, and manipulation. The library provides a direct interface to MuPDF's capabilities, enabling efficient operations like annotation, redaction, merging, splitting, and format conversion.

Quick Start & Requirements

Installation: pip install pymupdf
Prerequisites: Wheels are available for Python 3.10–3.14 on Windows, macOS, and Linux. Compilation from source requires a C/C++ toolchain.
Optional Packages:
- pymupdf-fonts: Extended font collection.
- pymupdf4llm: LLM/RAG-optimised Markdown and JSON extraction.
- pymupdfpro: Adds Office (DOC, DOCX, XLS, XLSX, PPT, PPTX) and Korean Office (HWP) document support (requires license key for full functionality).
- tesseract-ocr: Required for OCR on scanned pages; must be installed separately (e.g., sudo apt install tesseract-ocr on Debian/Ubuntu).
Documentation: Full details available at pymupdf.readthedocs.io.

Highlighted Details

Performance: Benchmarks indicate 10–50x speed improvements over pure-Python libraries for text extraction and over 100x for page rendering, with minimal memory usage.
LLM/RAG Integration: pymupdf4llm provides native Markdown and JSON output, optimized for AI and RAG pipelines, processing documents efficiently without requiring a GPU.
Versatility: Supports a wide range of input formats including PDF, XPS, EPUB, CBZ, MOBI, FB2, SVG, TXT, and various image types. Can convert Office documents to PDF (Pro version).
Local Execution: Operates entirely locally with no cloud dependencies, making it suitable for regulated industries, on-premise deployments, and air-gapped systems.
Features: Includes text/table extraction, image handling, annotation, redaction, form filling, PDF editing, encryption, and drawing capabilities.

Maintenance & Community

The project is actively maintained, with a Discord community available for support. Contributions are welcomed, with a recommendation to open an issue before submitting large pull requests.

Licensing & Compatibility

PyMuPDF is licensed under GNU AGPL v3 for open-source use, requiring derivative works to be shared under the same license. Commercial licenses are available from Artifex Software, Inc. for proprietary applications, offering broader compatibility without the copyleft restrictions of AGPL.

Limitations & Caveats

PyMuPDF does not support multithreading; multiprocessing should be used instead. OCR functionality requires a separate Tesseract installation and configuration. Full support for Office and HWP documents via pymupdfpro necessitates a commercial license key, with evaluation versions having page and time limits. PDFs with custom font encodings may require OCR for accurate text extraction.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

203 stars in the last 30 days