Discover and explore top open-source AI tools and projects—updated daily.
pymupdfHigh-performance PDF toolkit for data extraction and AI pipelines
Top 5.5% on SourcePulse
PyMuPDF is a high-performance Python library for comprehensive PDF (and other document) manipulation, data extraction, and conversion. Built upon the efficient MuPDF C engine, it offers developers precise, low-level control alongside convenient high-level APIs. It is particularly beneficial for AI pipelines and RAG systems requiring robust, local document processing without mandatory external dependencies.
How It Works
PyMuPDF leverages MuPDF, a lightweight and fast C rendering engine, to achieve its performance. This C-based foundation allows for rapid processing of PDF documents, including text extraction with detailed metadata, page rendering, image extraction, and manipulation. The library provides a direct interface to MuPDF's capabilities, enabling efficient operations like annotation, redaction, merging, splitting, and format conversion.
Quick Start & Requirements
pip install pymupdfpymupdf-fonts: Extended font collection.pymupdf4llm: LLM/RAG-optimised Markdown and JSON extraction.pymupdfpro: Adds Office (DOC, DOCX, XLS, XLSX, PPT, PPTX) and Korean Office (HWP) document support (requires license key for full functionality).tesseract-ocr: Required for OCR on scanned pages; must be installed separately (e.g., sudo apt install tesseract-ocr on Debian/Ubuntu).Highlighted Details
pymupdf4llm provides native Markdown and JSON output, optimized for AI and RAG pipelines, processing documents efficiently without requiring a GPU.Maintenance & Community
The project is actively maintained, with a Discord community available for support. Contributions are welcomed, with a recommendation to open an issue before submitting large pull requests.
Licensing & Compatibility
PyMuPDF is licensed under GNU AGPL v3 for open-source use, requiring derivative works to be shared under the same license. Commercial licenses are available from Artifex Software, Inc. for proprietary applications, offering broader compatibility without the copyleft restrictions of AGPL.
Limitations & Caveats
PyMuPDF does not support multithreading; multiprocessing should be used instead. OCR functionality requires a separate Tesseract installation and configuration. Full support for Office and HWP documents via pymupdfpro necessitates a commercial license key, with evaluation versions having page and time limits. PDFs with custom font encodings may require OCR for accurate text extraction.
3 days ago
Inactive
opendatalab