OCR research paper for unified end-to-end model
Top 6.8% on sourcepulse
GOT-OCR2.0 offers a unified, end-to-end model for Optical Character Recognition (OCR) and document understanding, targeting researchers and developers seeking advanced OCR capabilities. It aims to advance the field towards "OCR 2.0" by providing a versatile and powerful solution.
How It Works
GOT-OCR2.0 is built upon the Vary architecture, leveraging a large vision-language model (LLM) foundation. It treats OCR as a sequence-to-sequence problem, enabling it to handle complex document layouts, formatting, and even mathematical expressions. This unified approach allows for a single model to perform various OCR-related tasks, from basic text extraction to fine-grained formatting and layout analysis.
Quick Start & Requirements
conda create -n got python=3.10
), activate it (conda activate got
), and install dependencies (pip install -e .
). Flash-Attention requires pip install ninja
and pip install flash-attn --no-build-isolation
.Highlighted Details
Maintenance & Community
The project has seen significant community contributions, including OpenVINO, GGUF/llama.cpp, and vLLM integrations. Multiple WeChat groups are available for communication, though they are frequently full. Contact email: weihaoran18@mails.ucas.ac.cn.
Licensing & Compatibility
The repository does not explicitly state a license. The project is built upon Vary and Qwen, whose licenses should be consulted for compatibility. Commercial use implications are not detailed.
Limitations & Caveats
The provided training code is for post-training (stage-2/stage-3) on their weights; stage-1 training requires a different repository. The multi-crop OCR demo does not support batch inference and operates at the token level.
5 months ago
1 day