dots.ocr by rednote-hilab

Multilingual document layout parsing with a single vision-language model

Created 4 months ago

5,785 stars

Top 8.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

dots.ocr is a multilingual document parsing model that unifies layout detection and content recognition into a single vision-language model. It targets researchers and developers needing to extract structured information from diverse documents, offering SOTA performance with a compact 1.7B parameter LLM.

How It Works

This project leverages a single vision-language model (VLM) architecture, eliminating the need for complex multi-model pipelines common in traditional document parsing. By simply adjusting input prompts, the VLM seamlessly switches between layout detection and content recognition tasks. This unified approach simplifies the architecture and enables competitive detection results compared to specialized models like DocLayout-YOLO.

Quick Start & Requirements

Installation: Use conda to create an environment and pip install -e . after cloning the repository. PyTorch installation with CUDA 12.8 is recommended. A Docker image is available for easier setup.
Model Weights: Download weights using python3 tools/download_model.py.
Deployment: vLLM (v0.9.1 recommended) is suggested for efficient inference.
Demo: Gradio demos are available for various use cases, including layout parsing, OCR, and grounding OCR.
Links: HuggingFace Weights, Blog

Highlighted Details

Achieves SOTA performance on OmniDocBench for text, tables, and reading order.
Demonstrates strong multilingual support, especially for low-resource languages.
Offers faster inference speeds compared to larger models due to its 1.7B LLM foundation.
Handles formula recognition comparable to much larger models like Gemini2.5-Pro.

Maintenance & Community

The project acknowledges contributions from Qwen2.5-VL, aimv2, MonkeyOCR, and datasets like OmniDocBench, DocLayNet, M6Doc, CDLA, D4LA. Contact is available via email for collaboration.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The model has limitations with high-complexity tables and formulas, and does not parse pictures. Parsing may fail with excessively high character-to-pixel ratios or continuous special characters. Performance is not yet optimized for high-throughput large PDF volumes. The model performs optimally on images with resolutions under 11289600 pixels.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

281 stars in the last 30 days