MinerU-Diffusion  by opendatalab

Diffusion decoding for document OCR

Created 1 month ago
474 stars

Top 64.2% on SourcePulse

GitHubView on GitHub
Project Summary

A diffusion-based framework for document Optical Character Recognition (OCR), MinerU-Diffusion reframes OCR as an inverse rendering problem. It replaces slow, autoregressive decoding with a novel block-level parallel diffusion decoding approach, offering significant speedups and improved robustness for document analysis tasks. The framework is designed for researchers and developers seeking efficient and accurate document parsing solutions.

How It Works

MinerU-Diffusion employs parallel diffusion decoding, reconstructing structured text from masked tokens under visual conditioning. This contrasts with traditional left-to-right autoregressive methods. The approach utilizes uncertainty-driven curriculum learning and a structured block-attention mask during training, enabling parallel refinement within blocks while maintaining coarse autoregressive structure across blocks. This design facilitates parallel generation with global consistency and faster decoding.

Quick Start & Requirements

  • Primary install/run command:
    conda create -n dmineru python=3.12 -y
    conda activate dmineru
    pip install --upgrade pip
    pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
    pip install "transformers>=4.52.1"
    # Download and install flash-attn wheel manually if needed
    wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
    pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
    pip install -r requirements.txt
    
  • Non-default prerequisites: Python 3.12.12, torch 2.8.0+cu128, flash-attn 2.8.3 (matching CUDA, compiler, PyTorch stack), transformers 4.52.1. CUDA 12.8 is implied by the PyTorch wheel.
  • Links: Gradio-based online demo, Official online web application (login required).

Highlighted Details

  • Achieves up to 3.26x faster decoding throughput (TPS) compared to MinerU2.5, with flexible accuracy-throughput trade-offs (e.g., 2.12x speedup at 99.9% relative accuracy).
  • Supports specialized prompt types for Layout Detection (bounding boxes, categories, rotation), Text Recognition (raw OCR), Formula Recognition (LaTeX), and Table Recognition (OTSL).
  • Integrates with acceleration engines like SGLang and Nano-vLLM for efficient inference.

Maintenance & Community

The project acknowledges foundational work from models like MinerU, Qwen2-VL, SDAR, and LLaMA, and acceleration methods such as SGLang and Nano-vLLM. Links to related projects (MinerU, LabelU, OmniDocBench) are provided. No specific community channels or active contributor information is detailed in the README.

Licensing & Compatibility

This project is licensed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

The current release is V1, with V2 planned for future improvements in speed, elegance, and power. The training code is not yet released. Running the SGLang server requires separate setup following specific documentation.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
6
Star History
505 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.8%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 1 month ago
Feedback? Help us improve.