GLM-OCR  by zai-org

Accurate and fast multimodal OCR for complex documents

Created 3 months ago
6,702 stars

Top 7.5% on SourcePulse

GitHubView on GitHub
Project Summary

GLM-OCR is a multimodal OCR model designed for complex document understanding, offering a balance of accuracy, speed, and comprehensiveness. Built on the GLM-V architecture, it targets engineers and researchers needing to extract information from diverse and challenging document layouts, providing state-of-the-art performance and efficient inference capabilities.

How It Works

The model leverages a GLM-V encoder–decoder architecture, integrating the CogViT visual encoder pre-trained on large-scale image–text data. It employs Multi-Token Prediction (MTP) loss and reinforcement learning for improved training efficiency and accuracy. A two-stage pipeline, combining PP-DocLayout-V3 for layout analysis and parallel recognition, ensures robust performance across varied document structures, including complex tables and code-heavy content.

Quick Start & Requirements

Installation involves cloning the repository and installing the SDK (pip install -e .). Deployment options include a Zhipu MaaS API (no local GPU needed, recommended for quick start) or self-hosting the model locally using vLLM or SGLang for full control. Self-hosting requires installing these frameworks. Links to Hugging Face and ModelScope model downloads are provided.

Highlighted Details

  • Achieves #1 ranking on OmniDocBench V1.5 with a score of 94.62, demonstrating state-of-the-art performance across major document understanding benchmarks.
  • Optimized for real-world scenarios, handling complex tables, code-heavy documents, and seals robustly.
  • Features efficient inference with a 0.9B parameter model, supporting vLLM, SGLang, and Ollama for reduced latency and compute costs.
  • Offers a comprehensive SDK and toolchain for easy installation, one-line invocation, and production integration.

Maintenance & Community

The project encourages community engagement via WeChat and Discord. It acknowledges inspiration from PP-DocLayout-V3, PaddleOCR, and MinerU.

Licensing & Compatibility

The code is licensed under the Apache License 2.0, the GLM-OCR model under MIT License, and PP-DocLayoutV3 under Apache License 2.0. Users must comply with all applicable licenses.

Limitations & Caveats

The provided README does not detail specific limitations such as alpha status or known bugs, suggesting a production-ready state.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
3
Star History
615 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
376
Multimodal framework for vision-and-language transformer research
Created 4 years ago
Updated 3 years ago
Feedback? Help us improve.