MonkeyOCR  by Yuliang-Liu

LMM for document parsing

created 2 months ago
5,295 stars

Top 9.7% on sourcepulse

GitHubView on GitHub
Project Summary

MonkeyOCR is a lightweight, LMM-based model for parsing documents, designed to simplify complex multi-tool pipelines. It targets researchers and developers needing efficient and accurate document analysis for both English and Chinese content, offering improved performance on specialized elements like formulas and tables compared to existing methods.

How It Works

MonkeyOCR employs a Structure-Recognition-Relation (SRR) triplet paradigm. This approach integrates structure detection, content recognition, and relationship prediction into a single, unified model. This contrasts with modular pipelines, offering greater efficiency and avoiding the computational overhead of large multimodal models for full-page processing. The model achieves competitive performance with a significantly smaller parameter count (3B) than many state-of-the-art VLMs.

Quick Start & Requirements

  • Install via pip install huggingface_hub python tools/download_model.py or pip install modelscope python tools/download_model.py -t modelscope.
  • Requires Python and Hugging Face Hub/ModelScope.
  • Inference can be performed using python parse.py input_path.
  • A Gradio demo is available at http://vlrlabmonkey.xyz:7685.
  • Docker deployment is supported, requiring NVIDIA GPU support via nvidia-docker2.
  • See installation guide for details.

Highlighted Details

  • Achieves 5.1% average improvement across nine document types compared to MinerU.
  • Outperforms larger models like Gemini 2.5 Pro and Qwen2.5 VL-72B on English documents with its 3B parameter model.
  • Processes documents at 0.84 pages/second, faster than MinerU (0.65) and Qwen2.5 VL-7B (0.12).
  • Supports AWQ quantization for reduced memory footprint.

Maintenance & Community

  • The project released its English and Chinese parsing model on June 5, 2025.
  • The model is trending on Hugging Face.
  • Contact information for larger models or inquiries is provided.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • The model is intended for non-commercial use.

Limitations & Caveats

MonkeyOCR currently does not support photographed documents. The current single-GPU deployment may lead to availability issues during high traffic. The demo page processing time includes overhead beyond computation.

Health Check
Last commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
19
Issues (30d)
69
Star History
5,373 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.