dots.mocr by studio-dots-ai

Parse anything from documents with multimodal OCR

Created 4 months ago

304 stars

Top 87.8% on SourcePulse

Project Summary

Multimodal OCR: Parse Anything from Documents (dots.mocr) is a comprehensive document parsing system designed to recognize diverse human scripts and structured graphical content. It addresses the challenge of extracting information from complex documents by integrating grounding, recognition, semantic understanding, and dialogue capabilities. The project offers state-of-the-art performance and novel SVG conversion for visual elements, benefiting researchers and power users needing advanced document analysis.

How It Works

The core approach employs a multimodal vision-language model (VLM) for unified document understanding. It excels at converting structured graphics, such as charts, UI layouts, and scientific figures, directly into Scalable Vector Graphics (SVG) code. This direct SVG generation is a key differentiator, enabling precise representation of visual data, complemented by a specialized dots.mocr-svg variant for enhanced image-to-SVG parsing.

Quick Start & Requirements

Primary install / run command (pip, Docker, binary, etc.).
- Installation requires Python 3.12, PyTorch with CUDA 12.8 support, and flash-attn. Setup involves cloning the repo and using pip installs.
Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.).
- Python 3.12, CUDA >= 12.8, PyTorch, flash-attn. GPU is recommended.
Estimated setup time or resource footprint.
- Not specified.
If they are present, include links to official quick-start, docs, demo, or other relevant pages.
- arXiv: https://arxiv.org/abs/2603.13032
- vLLM Inference: Recommended for deployment; official Docker images are available.

Highlighted Details

Achieves state-of-the-art (SOTA) performance in multilingual document parsing and structured graphics-to-SVG conversion.
Demonstrates comparable general vision task performance to Qwen3-VL-4B.
Key benchmarks include an Elo score of 1124.7 on OmniDocBench and 83.9 on olmOCR-bench.
dots.mocr achieves 0.031 TextEdit and 0.029 Read OrderEdit on OmniDocBench v1.5.
The specialized dots.mocr-svg variant achieves 0.901 ISVGEN for image-to-SVG parsing.

Maintenance & Community

Community channels mentioned include WeChat and X (Twitter), but specific links are not provided.
No information on core contributors, sponsorships, or roadmap is available in the README.

Licensing & Compatibility

The license type and any compatibility notes for commercial use or closed-source linking are not specified in the provided README.

Limitations & Caveats

Extraction of complex tables and mathematical formulas remains challenging due to the model's compact 3B-parameter architecture.
The robustness of structured graphics parsing into SVG code has not yet reached desired levels.
Occasional parsing failures may still occur, although the rate has been reduced.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

28 stars in the last 30 days