text-extract-api  by CatchTheTornado

Document extraction and parsing API using OCR and Ollama models

created 9 months ago
2,771 stars

Top 17.6% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an API for extracting text and structured data from various document formats (PDF, Word, PPTX) and images, with capabilities for PII anonymization. It targets developers and users needing to process documents offline, leveraging modern OCR and LLM technologies for high accuracy and data transformation into JSON or Markdown.

How It Works

The API utilizes FastAPI for its web framework and Celery with Redis for asynchronous task processing and caching. It supports multiple OCR strategies, including EasyOCR, MiniCPM-V, and Llama Vision, with an option to integrate with external OCR services like marker-pdf. LLMs (via Ollama) are employed to refine OCR output, correct errors, and extract structured data based on user prompts.

Quick Start & Requirements

  • Install: Clone the repository and use make install or manual setup with pip install -e ..
  • Prerequisites: Ollama and Docker are required. For native GPU support on macOS, manual setup is recommended.
  • Dependencies: PyTorch, EasyOCR, Ollama, Redis, FastAPI.
  • Setup: make run or docker-compose up --build. GPU support requires docker-compose.gpu.yml.
  • Docs: demo.doctractor.com

Highlighted Details

  • Supports multiple OCR strategies: EasyOCR, MiniCPM-V, Llama Vision, and remote services.
  • Leverages Ollama for LLM integration, enabling PII removal and structured data extraction.
  • Offers flexible output formats (Markdown, JSON) and storage options (Local, Google Drive, S3).
  • Includes a CLI tool for direct interaction and task management.

Maintenance & Community

Licensing & Compatibility

  • Licensed under the MIT License.
  • The marker-pdf strategy has GPL3 licensing and specific commercial use restrictions for its weights.

Limitations & Caveats

Docker on macOS does not currently support Apple GPUs, requiring native setup for GPU acceleration. The DISABLE_LOCAL_OLLAMA environment variable is not yet functional within Docker environments.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
2
Star History
222 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.