GLM-OCR  by zai-org

Accurate and fast multimodal OCR for complex documents

Created 1 week ago

New!

1,398 stars

Top 28.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GLM-OCR is a multimodal OCR model designed for complex document understanding, offering a balance of accuracy, speed, and comprehensiveness. Built on the GLM-V architecture, it targets engineers and researchers needing to extract information from diverse and challenging document layouts, providing state-of-the-art performance and efficient inference capabilities.

How It Works

The model leverages a GLM-V encoder–decoder architecture, integrating the CogViT visual encoder pre-trained on large-scale image–text data. It employs Multi-Token Prediction (MTP) loss and reinforcement learning for improved training efficiency and accuracy. A two-stage pipeline, combining PP-DocLayout-V3 for layout analysis and parallel recognition, ensures robust performance across varied document structures, including complex tables and code-heavy content.

Quick Start & Requirements

Installation involves cloning the repository and installing the SDK (pip install -e .). Deployment options include a Zhipu MaaS API (no local GPU needed, recommended for quick start) or self-hosting the model locally using vLLM or SGLang for full control. Self-hosting requires installing these frameworks. Links to Hugging Face and ModelScope model downloads are provided.

Highlighted Details

  • Achieves #1 ranking on OmniDocBench V1.5 with a score of 94.62, demonstrating state-of-the-art performance across major document understanding benchmarks.
  • Optimized for real-world scenarios, handling complex tables, code-heavy documents, and seals robustly.
  • Features efficient inference with a 0.9B parameter model, supporting vLLM, SGLang, and Ollama for reduced latency and compute costs.
  • Offers a comprehensive SDK and toolchain for easy installation, one-line invocation, and production integration.

Maintenance & Community

The project encourages community engagement via WeChat and Discord. It acknowledges inspiration from PP-DocLayout-V3, PaddleOCR, and MinerU.

Licensing & Compatibility

The code is licensed under the Apache License 2.0, the GLM-OCR model under MIT License, and PP-DocLayoutV3 under Apache License 2.0. Users must comply with all applicable licenses.

Limitations & Caveats

The provided README does not detail specific limitations such as alpha status or known bugs, suggesting a production-ready state.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
33
Issues (30d)
61
Star History
1,405 stars in the last 10 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
375
Multimodal framework for vision-and-language transformer research
Created 4 years ago
Updated 3 years ago
Feedback? Help us improve.