GOT-OCR2.0  by Ucas-HaoranWei

OCR research paper for unified end-to-end model

created 11 months ago
7,747 stars

Top 6.8% on sourcepulse

GitHubView on GitHub
Project Summary

GOT-OCR2.0 offers a unified, end-to-end model for Optical Character Recognition (OCR) and document understanding, targeting researchers and developers seeking advanced OCR capabilities. It aims to advance the field towards "OCR 2.0" by providing a versatile and powerful solution.

How It Works

GOT-OCR2.0 is built upon the Vary architecture, leveraging a large vision-language model (LLM) foundation. It treats OCR as a sequence-to-sequence problem, enabling it to handle complex document layouts, formatting, and even mathematical expressions. This unified approach allows for a single model to perform various OCR-related tasks, from basic text extraction to fine-grained formatting and layout analysis.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n got python=3.10), activate it (conda activate got), and install dependencies (pip install -e .). Flash-Attention requires pip install ninja and pip install flash-attn --no-build-isolation.
  • Prerequisites: CUDA 11.8+, PyTorch 2.0.1+.
  • Resources: Requires significant GPU resources for training and potentially for inference, depending on model size.
  • Demo: Official demo available on Huggingface Spaces and Modelscope.
  • Docs: Paper available in the repository.

Highlighted Details

  • Achieved Huggingface trending #1 and over 1 million model downloads.
  • Supports batch inference and various deployment formats (Huggingface, PaddleMIX, ONNX, MNN, GGUF/llama.cpp).
  • Offers fine-tuning capabilities via ms-swift for custom datasets.
  • Includes benchmarks on Fox and OneChart datasets.

Maintenance & Community

The project has seen significant community contributions, including OpenVINO, GGUF/llama.cpp, and vLLM integrations. Multiple WeChat groups are available for communication, though they are frequently full. Contact email: weihaoran18@mails.ucas.ac.cn.

Licensing & Compatibility

The repository does not explicitly state a license. The project is built upon Vary and Qwen, whose licenses should be consulted for compatibility. Commercial use implications are not detailed.

Limitations & Caveats

The provided training code is for post-training (stage-2/stage-3) on their weights; stage-1 training requires a different repository. The multi-crop OCR demo does not support batch inference and operates at the token level.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
3
Star History
285 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.