DeepSeek-OCR-2  by deepseek-ai

Advanced OCR model for visual document intelligence

Created 1 month ago
2,346 stars

Top 19.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

DeepSeek-OCR 2 is an advanced Optical Character Recognition (OCR) system focused on "Visual Causal Flow" for enhanced visual encoding. It targets researchers and developers requiring high-accuracy text extraction from images and PDFs, offering capabilities for structured data conversion and human-like visual understanding.

How It Works

The system employs a novel "Visual Causal Flow" approach for visual encoding. It supports dynamic resolution processing, enabling flexible input image sizing and tokenization strategies to optimize accuracy and efficiency. Inference is optimized via vLLM for streaming image processing and concurrent PDF handling, or through the Hugging Face Transformers library with accelerated computation using flash attention.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (python=3.12.9), install specific PyTorch (2.6.0 with cu118), vllm-0.8.5+cu118 wheel, flash-attn==2.7.3, and other dependencies from requirements.txt.
  • Prerequisites: CUDA 11.8+, PyTorch 2.6.0, Python 3.12.9.
  • Inference:
    • vLLM: python run_dpsk_ocr2_image.py (images), python run_dpsk_ocr2_pdf.py (PDFs).
    • Transformers: python run_dpsk_ocr2.py or via Python API.
  • Resources: Links to Model download, Paper, and Arxiv Paper are mentioned but not provided as URLs.

Highlighted Details

  • Dynamic resolution support: (0-6)×768×768 + 1×1024×1024 — (0-6)×144 + 256 visual tokens.
  • Dual inference paths: vLLM for high-throughput streaming/concurrency and Hugging Face Transformers with flash attention.
  • Includes batch evaluation scripts for benchmarks like OmniDocBench v1.5.
  • Optimized inference with flash_attention_2 and torch.bfloat16.

Maintenance & Community

No specific details on maintainers, community channels (Discord/Slack), sponsorships, or roadmap were provided in the README.

Licensing & Compatibility

The license type is not specified in the provided README content, precluding assessment of commercial use or closed-source linking compatibility.

Limitations & Caveats

The core "Visual Causal Flow" concept lacks detailed explanation in the provided text. Citation information is pending ("coming soon~"). License details are absent, hindering compatibility assessment. Potential installation conflicts between vLLM and Transformers require careful environment management.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
44
Star History
2,376 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.