GLM-OCR by zai-org

Accurate and fast multimodal OCR for complex documents

Created 1 week ago

New!

1,398 stars

Top 28.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

GLM-OCR is a multimodal OCR model designed for complex document understanding, offering a balance of accuracy, speed, and comprehensiveness. Built on the GLM-V architecture, it targets engineers and researchers needing to extract information from diverse and challenging document layouts, providing state-of-the-art performance and efficient inference capabilities.

How It Works

The model leverages a GLM-V encoder–decoder architecture, integrating the CogViT visual encoder pre-trained on large-scale image–text data. It employs Multi-Token Prediction (MTP) loss and reinforcement learning for improved training efficiency and accuracy. A two-stage pipeline, combining PP-DocLayout-V3 for layout analysis and parallel recognition, ensures robust performance across varied document structures, including complex tables and code-heavy content.

Quick Start & Requirements

Installation involves cloning the repository and installing the SDK (pip install -e .). Deployment options include a Zhipu MaaS API (no local GPU needed, recommended for quick start) or self-hosting the model locally using vLLM or SGLang for full control. Self-hosting requires installing these frameworks. Links to Hugging Face and ModelScope model downloads are provided.

Highlighted Details

Achieves #1 ranking on OmniDocBench V1.5 with a score of 94.62, demonstrating state-of-the-art performance across major document understanding benchmarks.
Optimized for real-world scenarios, handling complex tables, code-heavy documents, and seals robustly.
Features efficient inference with a 0.9B parameter model, supporting vLLM, SGLang, and Ollama for reduced latency and compute costs.
Offers a comprehensive SDK and toolchain for easy installation, one-line invocation, and production integration.

Maintenance & Community

The project encourages community engagement via WeChat and Discord. It acknowledges inspiration from PP-DocLayout-V3, PaddleOCR, and MinerU.

Licensing & Compatibility

The code is licensed under the Apache License 2.0, the GLM-OCR model under MIT License, and PP-DocLayoutV3 under Apache License 2.0. Users must comply with all applicable licenses.

Limitations & Caveats

The provided README does not detail specific limitations such as alpha status or known bugs, suggesting a production-ready state.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1,405 stars in the last 10 days