GOT-OCR2.0 by Ucas-HaoranWei

OCR research paper for unified end-to-end model

Created 1 year ago

8,052 stars

Top 6.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

GOT-OCR2.0 offers a unified, end-to-end model for Optical Character Recognition (OCR) and document understanding, targeting researchers and developers seeking advanced OCR capabilities. It aims to advance the field towards "OCR 2.0" by providing a versatile and powerful solution.

How It Works

GOT-OCR2.0 is built upon the Vary architecture, leveraging a large vision-language model (LLM) foundation. It treats OCR as a sequence-to-sequence problem, enabling it to handle complex document layouts, formatting, and even mathematical expressions. This unified approach allows for a single model to perform various OCR-related tasks, from basic text extraction to fine-grained formatting and layout analysis.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n got python=3.10), activate it (conda activate got), and install dependencies (pip install -e .). Flash-Attention requires pip install ninja and pip install flash-attn --no-build-isolation.
Prerequisites: CUDA 11.8+, PyTorch 2.0.1+.
Resources: Requires significant GPU resources for training and potentially for inference, depending on model size.
Demo: Official demo available on Huggingface Spaces and Modelscope.
Docs: Paper available in the repository.

Highlighted Details

Achieved Huggingface trending #1 and over 1 million model downloads.
Supports batch inference and various deployment formats (Huggingface, PaddleMIX, ONNX, MNN, GGUF/llama.cpp).
Offers fine-tuning capabilities via ms-swift for custom datasets.
Includes benchmarks on Fox and OneChart datasets.

Maintenance & Community

The project has seen significant community contributions, including OpenVINO, GGUF/llama.cpp, and vLLM integrations. Multiple WeChat groups are available for communication, though they are frequently full. Contact email: weihaoran18@mails.ucas.ac.cn.

Licensing & Compatibility

The repository does not explicitly state a license. The project is built upon Vary and Qwen, whose licenses should be consulted for compatibility. Commercial use implications are not detailed.

Limitations & Caveats

The provided training code is for post-training (stage-2/stage-3) on their weights; stage-1 training requires a different repository. The multi-crop OCR demo does not support batch inference and operates at the token level.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

38 stars in the last 30 days