LMM for document parsing
Top 9.7% on sourcepulse
MonkeyOCR is a lightweight, LMM-based model for parsing documents, designed to simplify complex multi-tool pipelines. It targets researchers and developers needing efficient and accurate document analysis for both English and Chinese content, offering improved performance on specialized elements like formulas and tables compared to existing methods.
How It Works
MonkeyOCR employs a Structure-Recognition-Relation (SRR) triplet paradigm. This approach integrates structure detection, content recognition, and relationship prediction into a single, unified model. This contrasts with modular pipelines, offering greater efficiency and avoiding the computational overhead of large multimodal models for full-page processing. The model achieves competitive performance with a significantly smaller parameter count (3B) than many state-of-the-art VLMs.
Quick Start & Requirements
pip install huggingface_hub python tools/download_model.py
or pip install modelscope python tools/download_model.py -t modelscope
.python parse.py input_path
.http://vlrlabmonkey.xyz:7685
.nvidia-docker2
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
MonkeyOCR currently does not support photographed documents. The current single-GPU deployment may lead to availability issues during high traffic. The demo page processing time includes overhead beyond computation.
6 days ago
Inactive