PDF extraction tool for converting PDFs to Markdown and JSON
Top 0.7% on sourcepulse
MinerU is an open-source toolkit designed for high-quality PDF content extraction, converting documents into machine-readable Markdown and JSON formats. It targets researchers, developers, and users needing to process scientific literature and complex documents, offering features like semantic coherence, accurate reading order, and extraction of structural elements.
How It Works
MinerU employs a modular pipeline leveraging advanced models for layout analysis (doclayout_yolo), formula recognition (unimernet), and table extraction (rapid_table, slanet_plus). It replaces the Paddle framework with paddleocr2torch for improved compatibility and thread safety. The system supports OCR for 84 languages and can handle complex layouts, including multi-column and cross-page elements, aiming for human-readable output order.
Quick Start & Requirements
pip install -U "magic-pdf[full]"
Highlighted Details
Maintenance & Community
The project is actively developed with frequent updates, including model upgrades and bug fixes. Community engagement is encouraged via Discord and WeChat.
Licensing & Compatibility
The project uses PyMuPDF, which is AGPL licensed, potentially imposing restrictions on certain usage scenarios. Future plans include exploring more permissive PDF processing libraries.
Limitations & Caveats
Vertical text is not supported. Reading order may be imperfect in extremely complex layouts. Rule-based recognition of tables of contents and lists might miss uncommon formats. Code block recognition is not yet implemented in the layout model. OCR accuracy may vary for less common languages or scripts.
1 day ago
1 day