Discover and explore top open-source AI tools and projects—updated daily.
zai-orgAccurate and fast multimodal OCR for complex documents
New!
Top 28.8% on SourcePulse
GLM-OCR is a multimodal OCR model designed for complex document understanding, offering a balance of accuracy, speed, and comprehensiveness. Built on the GLM-V architecture, it targets engineers and researchers needing to extract information from diverse and challenging document layouts, providing state-of-the-art performance and efficient inference capabilities.
How It Works
The model leverages a GLM-V encoder–decoder architecture, integrating the CogViT visual encoder pre-trained on large-scale image–text data. It employs Multi-Token Prediction (MTP) loss and reinforcement learning for improved training efficiency and accuracy. A two-stage pipeline, combining PP-DocLayout-V3 for layout analysis and parallel recognition, ensures robust performance across varied document structures, including complex tables and code-heavy content.
Quick Start & Requirements
Installation involves cloning the repository and installing the SDK (pip install -e .). Deployment options include a Zhipu MaaS API (no local GPU needed, recommended for quick start) or self-hosting the model locally using vLLM or SGLang for full control. Self-hosting requires installing these frameworks. Links to Hugging Face and ModelScope model downloads are provided.
Highlighted Details
Maintenance & Community
The project encourages community engagement via WeChat and Discord. It acknowledges inspiration from PP-DocLayout-V3, PaddleOCR, and MinerU.
Licensing & Compatibility
The code is licensed under the Apache License 2.0, the GLM-OCR model under MIT License, and PP-DocLayoutV3 under Apache License 2.0. Users must comply with all applicable licenses.
Limitations & Caveats
The provided README does not detail specific limitations such as alpha status or known bugs, suggesting a production-ready state.
1 day ago
Inactive
zdou0830
zengyan-97
rednote-hilab