DeepSeek-OCR by deepseek-ai

Context-aware OCR model for visual-text compression

Created 2 months ago

21,958 stars

Top 2.0% on SourcePulse

View on GitHub

8 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Elvis Saravia

Founder of DAIR.AI

Magnus Müller

Cofounder of Browser Use

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

and 4 more!

Project Summary

Summary DeepSeek-OCR investigates vision encoders from an LLM-centric viewpoint, focusing on visual-text compression. It targets researchers and developers in advanced OCR and multimodal AI, offering a novel approach to visual data processing and text extraction by pushing the boundaries of LLM interpretation of visual information.

How It Works The core approach integrates vision encoders within an LLM framework for visual content analysis. This enables complex tasks like document-to-markdown conversion, OCR, and detailed image descriptions. It supports various input resolutions, including dynamic scaling, and utilizes advanced techniques like flash attention for optimized performance.

Quick Start & Requirements

Installation: Clone repo, set up Conda env (Python 3.12.9), install PyTorch (2.6.0, CUDA 11.8), vLLM (0.8.5), and flash-attn (2.7.3).
Prerequisites: CUDA >= 11.8, PyTorch >= 2.6.0, Python >= 3.12.9.
Inference: Supports vLLM (image/PDF, ~2500 tokens/s PDF on A100-40G) and Transformers pipelines (requires _attn_implementation='flash_attention_2', torch.bfloat16).
Links: Repository: https://github.com/deepseek-ai/DeepSeek-OCR. Model download/paper links mentioned but not provided.

Highlighted Details

Supports native resolutions (Tiny 512x512 to Large 1280x1280) and dynamic "Gundam" resolution (n×640×640 + 1×1024×1024).
Versatile prompting for document conversion, layout analysis, figure parsing, and detailed image descriptions.
vLLM PDF inference claims high throughput (~2500 tokens/s on A100-40G).
Optimized with flash-attn and supports bfloat16 precision.

Maintenance & Community The provided README lacks details on maintainers, community channels (e.g., Discord/Slack), sponsorships, or a roadmap. It acknowledges contributions from several other OCR and perception models/benchmarks.

Licensing & Compatibility The license type and compatibility notes for commercial or closed-source use are not specified in the provided README content.

Limitations & Caveats The release date is listed as [2025/10/20], potentially indicating a future or placeholder date. Specific performance benchmarks beyond the PDF throughput claim are not detailed. "Citation coming soon!" suggests a recent release with ongoing development.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

728 stars in the last 30 days