deepseek_ocr_app by rdumasia303

DeepSeek OCR web app for advanced document analysis

Created 8 months ago

1,850 stars

Top 22.6% on SourcePulse

Project Summary

This project provides a modern, web-based application for Optical Character Recognition (OCR) leveraging the DeepSeek-OCR model. It targets developers and users needing robust OCR capabilities through a user-friendly interface, offering features like text extraction, image description, and term localization with visual bounding boxes. The application combines a React frontend with a FastAPI backend, delivering a responsive and feature-rich OCR solution.

How It Works

The application utilizes the DeepSeek-OCR model via a FastAPI backend, exposing OCR functionalities through a REST API. A React frontend handles user interactions, including drag-and-drop file uploads and result visualization. Core processing involves four distinct OCR modes: Plain OCR for raw text, Describe for image captioning, Find for locating specific terms with bounding boxes, and Freeform for custom prompt-based analysis. For large images, a dynamic cropping strategy is employed, splitting images into tiles for processing while maintaining global context. The backend correctly scales bounding box coordinates from the model's normalized 0-999 range to actual pixel dimensions.

Quick Start & Requirements

Primary install/run command: docker compose up --build after cloning the repository and configuring .env from .env.example.
Non-default prerequisites: NVIDIA GPU with CUDA support (minimum 8-12GB VRAM recommended, e.g., RTX 3090/4090/5090), Docker & Docker Compose, NVIDIA Container Toolkit. Specific NVIDIA driver installation steps are provided for Ubuntu, particularly for newer GPUs like the RTX 5090, requiring open-source drivers and kernel upgrades.
Estimated setup time/resource footprint: Initial model download is ~5-10GB. Requires ~20GB disk space. Setup time heavily depends on Docker image pulls and model download speed.
Relevant links: Frontend: http://localhost:3000, API Docs: http://localhost:8000/docs.

Highlighted Details

Features 4 core OCR modes: Plain OCR, Describe, Find, and Freeform.
Modern UI with glass morphism design, animated gradients, and Framer Motion animations.
Supports HTML and Markdown rendering for formatted output, including tables.
Handles multiple bounding box detection and visualization with proper coordinate scaling.
Configurable via .env file for ports, upload limits (default 100MB), model cache, and processing resolutions.

Maintenance & Community

The provided README does not detail specific contributors, community channels (like Discord/Slack), or a public roadmap. Maintenance appears focused on bug fixes and stability improvements, as indicated by recent updates.

Licensing & Compatibility

The project itself is licensed under the MIT License. However, it utilizes the DeepSeek-OCR model, and users must adhere to the model's specific license terms. The MIT license generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The project has simplified its feature set from 12 to 4 core working modes (Plain OCR, Describe, Find, Freeform), with advanced modes like table extraction and PII detection pending further testing. A significant adoption blocker is the strict requirement for an NVIDIA GPU with CUDA support and the potentially complex setup of NVIDIA drivers and the NVIDIA Container Toolkit, especially on newer hardware and operating systems.

deepseek_ocr_app by rdumasia303

Explore Similar Projects

flipbook-app by imcuttle

dots.mocr by rednote-hilab

Osprey by CircleRadon

macos-vision-ocr by bytefer

deepseek-ocr-client by ihatecsv

DeepSeek-OCR-WebUI by neosun100

DeepSeek-OCR-Web by fufankeji

Monkey by Yuliang-Liu

DeepSeek-OCR-2 by deepseek-ai

PixelRAG by StarTrail-org

liteparse by run-llama

DeepSeek-OCR by deepseek-ai