MinerU-Diffusion by opendatalab

Diffusion decoding for document OCR

Created 4 months ago

622 stars

Top 52.3% on SourcePulse

Project Summary

A diffusion-based framework for document Optical Character Recognition (OCR), MinerU-Diffusion reframes OCR as an inverse rendering problem. It replaces slow, autoregressive decoding with a novel block-level parallel diffusion decoding approach, offering significant speedups and improved robustness for document analysis tasks. The framework is designed for researchers and developers seeking efficient and accurate document parsing solutions.

How It Works

MinerU-Diffusion employs parallel diffusion decoding, reconstructing structured text from masked tokens under visual conditioning. This contrasts with traditional left-to-right autoregressive methods. The approach utilizes uncertainty-driven curriculum learning and a structured block-attention mask during training, enabling parallel refinement within blocks while maintaining coarse autoregressive structure across blocks. This design facilitates parallel generation with global consistency and faster decoding.

Quick Start & Requirements

Primary install/run command:

conda create -n dmineru python=3.12 -y
conda activate dmineru
pip install --upgrade pip
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install "transformers>=4.52.1"
# Download and install flash-attn wheel manually if needed
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install -r requirements.txt

Non-default prerequisites: Python 3.12.12, torch 2.8.0+cu128, flash-attn 2.8.3 (matching CUDA, compiler, PyTorch stack), transformers 4.52.1. CUDA 12.8 is implied by the PyTorch wheel.
Links: Gradio-based online demo, Official online web application (login required).

Highlighted Details

Achieves up to 3.26x faster decoding throughput (TPS) compared to MinerU2.5, with flexible accuracy-throughput trade-offs (e.g., 2.12x speedup at 99.9% relative accuracy).
Supports specialized prompt types for Layout Detection (bounding boxes, categories, rotation), Text Recognition (raw OCR), Formula Recognition (LaTeX), and Table Recognition (OTSL).
Integrates with acceleration engines like SGLang and Nano-vLLM for efficient inference.

Maintenance & Community

The project acknowledges foundational work from models like MinerU, Qwen2-VL, SDAR, and LLaMA, and acceleration methods such as SGLang and Nano-vLLM. Links to related projects (MinerU, LabelU, OmniDocBench) are provided. No specific community channels or active contributor information is detailed in the README.

Licensing & Compatibility

This project is licensed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

The current release is V1, with V2 planned for future improvements in speed, elegance, and power. The training code is not yet released. Running the SGLang server requires separate setup following specific documentation.

MinerU-Diffusion by opendatalab

Explore Similar Projects

SDAR by JetAstra

flux-fp8-api by aredden

SkyPaint-AI-Diffusion by SkyWorkAIGC

WeDLM by Tencent

SmartResume by alibaba

tiny-diffusion by nathanrs

PolyglotPDF by CBIhalsen

DeepSeek-OCR-2 by deepseek-ai

EAGLE by SafeAILab

Awesome-LLM-Inference by xlite-dev

dots.ocr by rednote-hilab

Unlimited-OCR by baidu