Multilingual document layout parsing with a single vision-language model
New!
Top 20.6% on SourcePulse
dots.ocr is a multilingual document parsing model that unifies layout detection and content recognition into a single vision-language model. It targets researchers and developers needing to extract structured information from diverse documents, offering SOTA performance with a compact 1.7B parameter LLM.
How It Works
This project leverages a single vision-language model (VLM) architecture, eliminating the need for complex multi-model pipelines common in traditional document parsing. By simply adjusting input prompts, the VLM seamlessly switches between layout detection and content recognition tasks. This unified approach simplifies the architecture and enables competitive detection results compared to specialized models like DocLayout-YOLO.
Quick Start & Requirements
conda
to create an environment and pip install -e .
after cloning the repository. PyTorch installation with CUDA 12.8 is recommended. A Docker image is available for easier setup.python3 tools/download_model.py
.Highlighted Details
Maintenance & Community
The project acknowledges contributions from Qwen2.5-VL, aimv2, MonkeyOCR, and datasets like OmniDocBench, DocLayNet, M6Doc, CDLA, D4LA. Contact is available via email for collaboration.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The model has limitations with high-complexity tables and formulas, and does not parse pictures. Parsing may fail with excessively high character-to-pixel ratios or continuous special characters. Performance is not yet optimized for high-throughput large PDF volumes. The model performs optimally on images with resolutions under 11289600 pixels.
4 days ago
Inactive