Discover and explore top open-source AI tools and projects—updated daily.
rdumasia303DeepSeek OCR web app for advanced document analysis
Top 24.2% on SourcePulse
This project provides a modern, web-based application for Optical Character Recognition (OCR) leveraging the DeepSeek-OCR model. It targets developers and users needing robust OCR capabilities through a user-friendly interface, offering features like text extraction, image description, and term localization with visual bounding boxes. The application combines a React frontend with a FastAPI backend, delivering a responsive and feature-rich OCR solution.
How It Works
The application utilizes the DeepSeek-OCR model via a FastAPI backend, exposing OCR functionalities through a REST API. A React frontend handles user interactions, including drag-and-drop file uploads and result visualization. Core processing involves four distinct OCR modes: Plain OCR for raw text, Describe for image captioning, Find for locating specific terms with bounding boxes, and Freeform for custom prompt-based analysis. For large images, a dynamic cropping strategy is employed, splitting images into tiles for processing while maintaining global context. The backend correctly scales bounding box coordinates from the model's normalized 0-999 range to actual pixel dimensions.
Quick Start & Requirements
docker compose up --build after cloning the repository and configuring .env from .env.example.http://localhost:3000, API Docs: http://localhost:8000/docs.Highlighted Details
.env file for ports, upload limits (default 100MB), model cache, and processing resolutions.Maintenance & Community
The provided README does not detail specific contributors, community channels (like Discord/Slack), or a public roadmap. Maintenance appears focused on bug fixes and stability improvements, as indicated by recent updates.
Licensing & Compatibility
The project itself is licensed under the MIT License. However, it utilizes the DeepSeek-OCR model, and users must adhere to the model's specific license terms. The MIT license generally permits commercial use and integration into closed-source projects.
Limitations & Caveats
The project has simplified its feature set from 12 to 4 core working modes (Plain OCR, Describe, Find, Freeform), with advanced modes like table extraction and PII detection pending further testing. A significant adoption blocker is the strict requirement for an NVIDIA GPU with CUDA support and the potentially complex setup of NVIDIA drivers and the NVIDIA Container Toolkit, especially on newer hardware and operating systems.
1 week ago
Inactive
sharonzhou
deepseek-ai