dots.mocr  by rednote-hilab

Parse anything from documents with multimodal OCR

Created 2 months ago
263 stars

Top 96.9% on SourcePulse

GitHubView on GitHub
Project Summary

Multimodal OCR: Parse Anything from Documents (dots.mocr) is a comprehensive document parsing system designed to recognize diverse human scripts and structured graphical content. It addresses the challenge of extracting information from complex documents by integrating grounding, recognition, semantic understanding, and dialogue capabilities. The project offers state-of-the-art performance and novel SVG conversion for visual elements, benefiting researchers and power users needing advanced document analysis.

How It Works

The core approach employs a multimodal vision-language model (VLM) for unified document understanding. It excels at converting structured graphics, such as charts, UI layouts, and scientific figures, directly into Scalable Vector Graphics (SVG) code. This direct SVG generation is a key differentiator, enabling precise representation of visual data, complemented by a specialized dots.mocr-svg variant for enhanced image-to-SVG parsing.

Quick Start & Requirements

  • Primary install / run command (pip, Docker, binary, etc.).
    • Installation requires Python 3.12, PyTorch with CUDA 12.8 support, and flash-attn. Setup involves cloning the repo and using pip installs.
  • Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.).
    • Python 3.12, CUDA >= 12.8, PyTorch, flash-attn. GPU is recommended.
  • Estimated setup time or resource footprint.
    • Not specified.
  • If they are present, include links to official quick-start, docs, demo, or other relevant pages.

Highlighted Details

  • Achieves state-of-the-art (SOTA) performance in multilingual document parsing and structured graphics-to-SVG conversion.
  • Demonstrates comparable general vision task performance to Qwen3-VL-4B.
  • Key benchmarks include an Elo score of 1124.7 on OmniDocBench and 83.9 on olmOCR-bench.
  • dots.mocr achieves 0.031 TextEdit and 0.029 Read OrderEdit on OmniDocBench v1.5.
  • The specialized dots.mocr-svg variant achieves 0.901 ISVGEN for image-to-SVG parsing.

Maintenance & Community

  • Community channels mentioned include WeChat and X (Twitter), but specific links are not provided.
  • No information on core contributors, sponsorships, or roadmap is available in the README.

Licensing & Compatibility

  • The license type and any compatibility notes for commercial use or closed-source linking are not specified in the provided README.

Limitations & Caveats

  • Extraction of complex tables and mathematical formulas remains challenging due to the model's compact 3B-parameter architecture.
  • The robustness of structured graphics parsing into SVG code has not yet reached desired levels.
  • Occasional parsing failures may still occur, although the rate has been reduced.
Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
45 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.