DeepSeek-OCR-2 by deepseek-ai

Advanced OCR model for visual document intelligence

Created 5 months ago

3,119 stars

Top 14.9% on SourcePulse

Project Summary

Summary

DeepSeek-OCR 2 is an advanced Optical Character Recognition (OCR) system focused on "Visual Causal Flow" for enhanced visual encoding. It targets researchers and developers requiring high-accuracy text extraction from images and PDFs, offering capabilities for structured data conversion and human-like visual understanding.

How It Works

The system employs a novel "Visual Causal Flow" approach for visual encoding. It supports dynamic resolution processing, enabling flexible input image sizing and tokenization strategies to optimize accuracy and efficiency. Inference is optimized via vLLM for streaming image processing and concurrent PDF handling, or through the Hugging Face Transformers library with accelerated computation using flash attention.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (python=3.12.9), install specific PyTorch (2.6.0 with cu118), vllm-0.8.5+cu118 wheel, flash-attn==2.7.3, and other dependencies from requirements.txt.
Prerequisites: CUDA 11.8+, PyTorch 2.6.0, Python 3.12.9.
Inference:
- vLLM: python run_dpsk_ocr2_image.py (images), python run_dpsk_ocr2_pdf.py (PDFs).
- Transformers: python run_dpsk_ocr2.py or via Python API.
Resources: Links to Model download, Paper, and Arxiv Paper are mentioned but not provided as URLs.

Highlighted Details

Dynamic resolution support: (0-6)×768×768 + 1×1024×1024 — (0-6)×144 + 256 visual tokens.
Dual inference paths: vLLM for high-throughput streaming/concurrency and Hugging Face Transformers with flash attention.
Includes batch evaluation scripts for benchmarks like OmniDocBench v1.5.
Optimized inference with flash_attention_2 and torch.bfloat16.

Maintenance & Community

No specific details on maintainers, community channels (Discord/Slack), sponsorships, or roadmap were provided in the README.

Licensing & Compatibility

The license type is not specified in the provided README content, precluding assessment of commercial use or closed-source linking compatibility.

Limitations & Caveats

The core "Visual Causal Flow" concept lacks detailed explanation in the provided text. Citation information is pending ("coming soon~"). License details are absent, hindering compatibility assessment. Potential installation conflicts between vLLM and Transformers require careful environment management.

DeepSeek-OCR-2 by deepseek-ai

Explore Similar Projects

Awesome-Generative-Models-for-OCR by NiceRingNode

AWESOME-OCR-LLM by Yuliang-Liu

deepseek-ocr-client by ihatecsv

awesome-ocr by zacharywhitley

OpenOCR by Topdu

Monkey by Yuliang-Liu

AdvancedLiterateMachinery by AlibabaResearch

GOT-OCR2.0 by Ucas-HaoranWei

liteparse by run-llama

dots.ocr by rednote-hilab

Unlimited-OCR by baidu

DeepSeek-OCR by deepseek-ai