deepseek_ocr_app  by rdumasia303

DeepSeek OCR web app for advanced document analysis

Created 5 months ago
1,736 stars

Top 24.2% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a modern, web-based application for Optical Character Recognition (OCR) leveraging the DeepSeek-OCR model. It targets developers and users needing robust OCR capabilities through a user-friendly interface, offering features like text extraction, image description, and term localization with visual bounding boxes. The application combines a React frontend with a FastAPI backend, delivering a responsive and feature-rich OCR solution.

How It Works

The application utilizes the DeepSeek-OCR model via a FastAPI backend, exposing OCR functionalities through a REST API. A React frontend handles user interactions, including drag-and-drop file uploads and result visualization. Core processing involves four distinct OCR modes: Plain OCR for raw text, Describe for image captioning, Find for locating specific terms with bounding boxes, and Freeform for custom prompt-based analysis. For large images, a dynamic cropping strategy is employed, splitting images into tiles for processing while maintaining global context. The backend correctly scales bounding box coordinates from the model's normalized 0-999 range to actual pixel dimensions.

Quick Start & Requirements

  • Primary install/run command: docker compose up --build after cloning the repository and configuring .env from .env.example.
  • Non-default prerequisites: NVIDIA GPU with CUDA support (minimum 8-12GB VRAM recommended, e.g., RTX 3090/4090/5090), Docker & Docker Compose, NVIDIA Container Toolkit. Specific NVIDIA driver installation steps are provided for Ubuntu, particularly for newer GPUs like the RTX 5090, requiring open-source drivers and kernel upgrades.
  • Estimated setup time/resource footprint: Initial model download is ~5-10GB. Requires ~20GB disk space. Setup time heavily depends on Docker image pulls and model download speed.
  • Relevant links: Frontend: http://localhost:3000, API Docs: http://localhost:8000/docs.

Highlighted Details

  • Features 4 core OCR modes: Plain OCR, Describe, Find, and Freeform.
  • Modern UI with glass morphism design, animated gradients, and Framer Motion animations.
  • Supports HTML and Markdown rendering for formatted output, including tables.
  • Handles multiple bounding box detection and visualization with proper coordinate scaling.
  • Configurable via .env file for ports, upload limits (default 100MB), model cache, and processing resolutions.

Maintenance & Community

The provided README does not detail specific contributors, community channels (like Discord/Slack), or a public roadmap. Maintenance appears focused on bug fixes and stability improvements, as indicated by recent updates.

Licensing & Compatibility

The project itself is licensed under the MIT License. However, it utilizes the DeepSeek-OCR model, and users must adhere to the model's specific license terms. The MIT license generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The project has simplified its feature set from 12 to 4 core working modes (Plain OCR, Describe, Find, Freeform), with advanced modes like table extraction and PII detection pending further testing. A significant adoption blocker is the strict requirement for an NVIDIA GPU with CUDA support and the potentially complex setup of NVIDIA drivers and the NVIDIA Container Toolkit, especially on newer hardware and operating systems.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.