HunyuanOCR by Tencent-Hunyuan

Advanced OCR and document understanding via lightweight VLM

Created 3 months ago

1,568 stars

Top 26.2% on SourcePulse

Project Summary

HunyuanOCR is an end-to-end OCR expert VLM built on a lightweight 1B parameter multimodal architecture. It achieves state-of-the-art performance across complex multilingual document parsing, text spotting, information extraction, video subtitle extraction, and photo translation, offering significant deployment cost reductions and enhanced usability compared to cascaded solutions.

How It Works

Leveraging Hunyuan's native multimodal architecture and training strategy, HunyuanOCR achieves SOTA performance with a remarkably efficient 1B parameter design. This end-to-end approach integrates text detection, recognition, complex document parsing, information extraction, and translation into a single model, simplifying inference and reducing deployment costs. Its design prioritizes ultimate usability through single-instruction, single-inference operations.

Quick Start & Requirements

Installation: Recommended via vllm (pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly or uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly). Transformers installation is also available (pip install git+https://github.com/huggingface/transformers@82a06db03535c49aa987719ed0746a76093b1ec4).
Prerequisites: Linux OS, Python 3.12+, CUDA 12.9, PyTorch 2.7.1, NVIDIA GPU with CUDA support.
Hardware: Minimum 20GB GPU Memory (for vLLM), 6GB Disk Space.
Demo: A demo script run_hy_ocr.py is available in Hunyuan-OCR-master/Hunyuan-OCR-hf.

Highlighted Details

Achieves state-of-the-art (SOTA) performance across multiple OCR tasks with a highly efficient 1B parameter model.
Excels in complex multilingual document parsing, supporting over 100 languages and handling mixed-language scenarios.
Demonstrates superior performance in text spotting, information extraction (cards, receipts), and video subtitle extraction, outperforming larger general VLMs.
Offers robust capabilities for photo translation and document Question Answering.

Maintenance & Community

The project acknowledges contributions and ideas from PaddleOCR, MinerU, MonkeyOCR, DeepSeek-OCR, dots.ocr, and benchmarks like OminiDocBench, OCRBench, DoTA. Support from vLLM and Hugging Face Communities for inference is also noted. No explicit community links or roadmap details are provided.

Licensing & Compatibility

No explicit license information is provided in the README.

Limitations & Caveats

The Transformers inference method currently exhibits performance degradation compared to the vLLM framework, though this is being addressed. The requirement for CUDA 12.9 is specific and may pose an adoption barrier.

HunyuanOCR by Tencent-Hunyuan

Explore Similar Projects

YomiNinja by matt-m-o

Vary-toy by Ucas-HaoranWei

Versatile-OCR-Program by ses4255

SmartResume by alibaba

DeepSeek-OCR-WebUI by neosun100

GLM-OCR by zai-org

deepdoctection by deepdoctection

AdvancedLiterateMachinery by AlibabaResearch

PolyglotPDF by CBIhalsen

STranslate by STranslate

dots.ocr by rednote-hilab

LunaTranslator by HIllya51