MinerU by opendatalab

PDF extraction tool for converting PDFs to Markdown and JSON

Created 1 year ago

51,874 stars

Top 0.5% on SourcePulse

View on GitHub

4 Experts Love This Project

Travis Fischer

Founder of Agentic

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Pawel Garbacki

Cofounder of Fireworks AI

Casper Hansen

Author of AutoAWQ

Project Summary

MinerU is an open-source toolkit designed for high-quality PDF content extraction, converting documents into machine-readable Markdown and JSON formats. It targets researchers, developers, and users needing to process scientific literature and complex documents, offering features like semantic coherence, accurate reading order, and extraction of structural elements.

How It Works

MinerU employs a modular pipeline leveraging advanced models for layout analysis (doclayout_yolo), formula recognition (unimernet), and table extraction (rapid_table, slanet_plus). It replaces the Paddle framework with paddleocr2torch for improved compatibility and thread safety. The system supports OCR for 84 languages and can handle complex layouts, including multi-column and cross-page elements, aiming for human-readable output order.

Quick Start & Requirements

Install: pip install -U "magic-pdf[full]"
Prerequisites: Python >= 3.10, CUDA 11.8/12.4/12.6/12.8 (for GPU), or Ascend NPU drivers (for NPU). Minimum 6GB VRAM for GPU acceleration, 8GB RAM for CPU.
Setup: Requires downloading model weights separately.
Demos: Online Demo, HuggingFace, ModelScope, Colab Notebook.
Docs: FAQ, Known Issues

Highlighted Details

Supports GPU (CUDA), NPU (CANN), and MPS (Apple Silicon) acceleration.
Handles complex layouts, including headers, footers, footnotes, page numbers, and multi-column text.
Extracts images, captions, tables, and converts formulas to LaTeX.
Offers batch processing for improved speed, with formula parsing up to 1400% faster than v1.0.1.

Maintenance & Community

The project is actively developed with frequent updates, including model upgrades and bug fixes. Community engagement is encouraged via Discord and WeChat.

Licensing & Compatibility

The project uses PyMuPDF, which is AGPL licensed, potentially imposing restrictions on certain usage scenarios. Future plans include exploring more permissive PDF processing libraries.

Limitations & Caveats

Vertical text is not supported. Reading order may be imperfect in extremely complex layouts. Rule-based recognition of tables of contents and lists might miss uncommon formats. Code block recognition is not yet implemented in the layout model. OCR accuracy may vary for less common languages or scripts.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1,685 stars in the last 30 days