Discover and explore top open-source AI tools and projects—updated daily.
opendatalabDiffusion decoding for document OCR
Top 64.2% on SourcePulse
A diffusion-based framework for document Optical Character Recognition (OCR), MinerU-Diffusion reframes OCR as an inverse rendering problem. It replaces slow, autoregressive decoding with a novel block-level parallel diffusion decoding approach, offering significant speedups and improved robustness for document analysis tasks. The framework is designed for researchers and developers seeking efficient and accurate document parsing solutions.
How It Works
MinerU-Diffusion employs parallel diffusion decoding, reconstructing structured text from masked tokens under visual conditioning. This contrasts with traditional left-to-right autoregressive methods. The approach utilizes uncertainty-driven curriculum learning and a structured block-attention mask during training, enabling parallel refinement within blocks while maintaining coarse autoregressive structure across blocks. This design facilitates parallel generation with global consistency and faster decoding.
Quick Start & Requirements
conda create -n dmineru python=3.12 -y
conda activate dmineru
pip install --upgrade pip
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install "transformers>=4.52.1"
# Download and install flash-attn wheel manually if needed
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
The project acknowledges foundational work from models like MinerU, Qwen2-VL, SDAR, and LLaMA, and acceleration methods such as SGLang and Nano-vLLM. Links to related projects (MinerU, LabelU, OmniDocBench) are provided. No specific community channels or active contributor information is detailed in the README.
Licensing & Compatibility
This project is licensed under the MIT License, permitting commercial use and closed-source linking.
Limitations & Caveats
The current release is V1, with V2 planned for future improvements in speed, elegance, and power. The training code is not yet released. Running the SGLang server requires separate setup following specific documentation.
1 week ago
Inactive
nathan-barry
SafeAILab
rednote-hilab