MinerU  by opendatalab

PDF extraction tool for converting PDFs to Markdown and JSON

created 1 year ago
41,013 stars

Top 0.7% on sourcepulse

GitHubView on GitHub
Project Summary

MinerU is an open-source toolkit designed for high-quality PDF content extraction, converting documents into machine-readable Markdown and JSON formats. It targets researchers, developers, and users needing to process scientific literature and complex documents, offering features like semantic coherence, accurate reading order, and extraction of structural elements.

How It Works

MinerU employs a modular pipeline leveraging advanced models for layout analysis (doclayout_yolo), formula recognition (unimernet), and table extraction (rapid_table, slanet_plus). It replaces the Paddle framework with paddleocr2torch for improved compatibility and thread safety. The system supports OCR for 84 languages and can handle complex layouts, including multi-column and cross-page elements, aiming for human-readable output order.

Quick Start & Requirements

  • Install: pip install -U "magic-pdf[full]"
  • Prerequisites: Python >= 3.10, CUDA 11.8/12.4/12.6/12.8 (for GPU), or Ascend NPU drivers (for NPU). Minimum 6GB VRAM for GPU acceleration, 8GB RAM for CPU.
  • Setup: Requires downloading model weights separately.
  • Demos: Online Demo, HuggingFace, ModelScope, Colab Notebook.
  • Docs: FAQ, Known Issues

Highlighted Details

  • Supports GPU (CUDA), NPU (CANN), and MPS (Apple Silicon) acceleration.
  • Handles complex layouts, including headers, footers, footnotes, page numbers, and multi-column text.
  • Extracts images, captions, tables, and converts formulas to LaTeX.
  • Offers batch processing for improved speed, with formula parsing up to 1400% faster than v1.0.1.

Maintenance & Community

The project is actively developed with frequent updates, including model upgrades and bug fixes. Community engagement is encouraged via Discord and WeChat.

Licensing & Compatibility

The project uses PyMuPDF, which is AGPL licensed, potentially imposing restrictions on certain usage scenarios. Future plans include exploring more permissive PDF processing libraries.

Limitations & Caveats

Vertical text is not supported. Reading order may be imperfect in extremely complex layouts. Rule-based recognition of tables of contents and lists might miss uncommon formats. Code block recognition is not yet implemented in the layout model. OCR accuracy may vary for less common languages or scripts.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
142
Issues (30d)
95
Star History
8,722 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.