MonkeyOCR by Yuliang-Liu

LMM for document parsing

Created 7 months ago

6,427 stars

Top 7.9% on SourcePulse

Project Summary

MonkeyOCR is a lightweight, LMM-based model for parsing documents, designed to simplify complex multi-tool pipelines. It targets researchers and developers needing efficient and accurate document analysis for both English and Chinese content, offering improved performance on specialized elements like formulas and tables compared to existing methods.

How It Works

MonkeyOCR employs a Structure-Recognition-Relation (SRR) triplet paradigm. This approach integrates structure detection, content recognition, and relationship prediction into a single, unified model. This contrasts with modular pipelines, offering greater efficiency and avoiding the computational overhead of large multimodal models for full-page processing. The model achieves competitive performance with a significantly smaller parameter count (3B) than many state-of-the-art VLMs.

Quick Start & Requirements

Install via pip install huggingface_hub python tools/download_model.py or pip install modelscope python tools/download_model.py -t modelscope.
Requires Python and Hugging Face Hub/ModelScope.
Inference can be performed using python parse.py input_path.
A Gradio demo is available at http://vlrlabmonkey.xyz:7685.
Docker deployment is supported, requiring NVIDIA GPU support via nvidia-docker2.
See installation guide for details.

Highlighted Details

Achieves 5.1% average improvement across nine document types compared to MinerU.
Outperforms larger models like Gemini 2.5 Pro and Qwen2.5 VL-72B on English documents with its 3B parameter model.
Processes documents at 0.84 pages/second, faster than MinerU (0.65) and Qwen2.5 VL-7B (0.12).
Supports AWQ quantization for reduced memory footprint.

Maintenance & Community

The project released its English and Chinese parsing model on June 5, 2025.
The model is trending on Hugging Face.
Contact information for larger models or inquiries is provided.

Licensing & Compatibility

Licensed under Apache 2.0.
The model is intended for non-commercial use.

Limitations & Caveats

MonkeyOCR currently does not support photographed documents. The current single-GPU deployment may lead to availability issues during high traffic. The demo page processing time includes overhead beyond computation.

MonkeyOCR by Yuliang-Liu

Explore Similar Projects

SmartResume by alibaba

documind by DocumindHQ

spacy-layout by explosion

Logics-Parsing by alibaba

DocBank by doc-analysis

HunyuanOCR by Tencent-Hunyuan

OmniDocBench by opendatalab

open-parse by Filimoa

ExtractThinker by enoch3712

MegaParse by QuivrHQ

PDF-Extract-Kit by opendatalab

Dolphin by bytedance