DeepSeek-OCR  by deepseek-ai

Context-aware OCR model for visual-text compression

Created 2 weeks ago

New!

19,324 stars

Top 2.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary DeepSeek-OCR investigates vision encoders from an LLM-centric viewpoint, focusing on visual-text compression. It targets researchers and developers in advanced OCR and multimodal AI, offering a novel approach to visual data processing and text extraction by pushing the boundaries of LLM interpretation of visual information.

How It Works The core approach integrates vision encoders within an LLM framework for visual content analysis. This enables complex tasks like document-to-markdown conversion, OCR, and detailed image descriptions. It supports various input resolutions, including dynamic scaling, and utilizes advanced techniques like flash attention for optimized performance.

Quick Start & Requirements

  • Installation: Clone repo, set up Conda env (Python 3.12.9), install PyTorch (2.6.0, CUDA 11.8), vLLM (0.8.5), and flash-attn (2.7.3).
  • Prerequisites: CUDA >= 11.8, PyTorch >= 2.6.0, Python >= 3.12.9.
  • Inference: Supports vLLM (image/PDF, ~2500 tokens/s PDF on A100-40G) and Transformers pipelines (requires _attn_implementation='flash_attention_2', torch.bfloat16).
  • Links: Repository: https://github.com/deepseek-ai/DeepSeek-OCR. Model download/paper links mentioned but not provided.

Highlighted Details

  • Supports native resolutions (Tiny 512x512 to Large 1280x1280) and dynamic "Gundam" resolution (n×640×640 + 1×1024×1024).
  • Versatile prompting for document conversion, layout analysis, figure parsing, and detailed image descriptions.
  • vLLM PDF inference claims high throughput (~2500 tokens/s on A100-40G).
  • Optimized with flash-attn and supports bfloat16 precision.

Maintenance & Community The provided README lacks details on maintainers, community channels (e.g., Discord/Slack), sponsorships, or a roadmap. It acknowledges contributions from several other OCR and perception models/benchmarks.

Licensing & Compatibility The license type and compatibility notes for commercial or closed-source use are not specified in the provided README content.

Limitations & Caveats The release date is listed as [2025/10/20], potentially indicating a future or placeholder date. Specific performance benchmarks beyond the PDF throughput claim are not detailed. "Citation coming soon!" suggests a recent release with ongoing development.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
25
Issues (30d)
191
Star History
19,953 stars in the last 18 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
57 more.

stable-diffusion by CompVis

0.1%
72k
Latent text-to-image diffusion model
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.