HunyuanOCR  by Tencent-Hunyuan

Advanced OCR and document understanding via lightweight VLM

Created 1 week ago

New!

771 stars

Top 45.3% on SourcePulse

GitHubView on GitHub
Project Summary

HunyuanOCR is an end-to-end OCR expert VLM built on a lightweight 1B parameter multimodal architecture. It achieves state-of-the-art performance across complex multilingual document parsing, text spotting, information extraction, video subtitle extraction, and photo translation, offering significant deployment cost reductions and enhanced usability compared to cascaded solutions.

How It Works

Leveraging Hunyuan's native multimodal architecture and training strategy, HunyuanOCR achieves SOTA performance with a remarkably efficient 1B parameter design. This end-to-end approach integrates text detection, recognition, complex document parsing, information extraction, and translation into a single model, simplifying inference and reducing deployment costs. Its design prioritizes ultimate usability through single-instruction, single-inference operations.

Quick Start & Requirements

  • Installation: Recommended via vllm (pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly or uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly). Transformers installation is also available (pip install git+https://github.com/huggingface/transformers@82a06db03535c49aa987719ed0746a76093b1ec4).
  • Prerequisites: Linux OS, Python 3.12+, CUDA 12.9, PyTorch 2.7.1, NVIDIA GPU with CUDA support.
  • Hardware: Minimum 20GB GPU Memory (for vLLM), 6GB Disk Space.
  • Demo: A demo script run_hy_ocr.py is available in Hunyuan-OCR-master/Hunyuan-OCR-hf.

Highlighted Details

  • Achieves state-of-the-art (SOTA) performance across multiple OCR tasks with a highly efficient 1B parameter model.
  • Excels in complex multilingual document parsing, supporting over 100 languages and handling mixed-language scenarios.
  • Demonstrates superior performance in text spotting, information extraction (cards, receipts), and video subtitle extraction, outperforming larger general VLMs.
  • Offers robust capabilities for photo translation and document Question Answering.

Maintenance & Community

The project acknowledges contributions and ideas from PaddleOCR, MinerU, MonkeyOCR, DeepSeek-OCR, dots.ocr, and benchmarks like OminiDocBench, OCRBench, DoTA. Support from vLLM and Hugging Face Communities for inference is also noted. No explicit community links or roadmap details are provided.

Licensing & Compatibility

No explicit license information is provided in the README.

Limitations & Caveats

The Transformers inference method currently exhibits performance degradation compared to the vLLM framework, though this is being addressed. The requirement for CUDA 12.9 is specific and may pose an adoption barrier.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
43
Star History
805 stars in the last 13 days

Explore Similar Projects

Feedback? Help us improve.