text-extract-api by CatchTheTornado

Document extraction and parsing API using OCR and Ollama models

Created 1 year ago

2,966 stars

Top 16.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Didier Lopes

Founder of OpenBB

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This project provides an API for extracting text and structured data from various document formats (PDF, Word, PPTX) and images, with capabilities for PII anonymization. It targets developers and users needing to process documents offline, leveraging modern OCR and LLM technologies for high accuracy and data transformation into JSON or Markdown.

How It Works

The API utilizes FastAPI for its web framework and Celery with Redis for asynchronous task processing and caching. It supports multiple OCR strategies, including EasyOCR, MiniCPM-V, and Llama Vision, with an option to integrate with external OCR services like marker-pdf. LLMs (via Ollama) are employed to refine OCR output, correct errors, and extract structured data based on user prompts.

Quick Start & Requirements

Install: Clone the repository and use make install or manual setup with pip install -e ..
Prerequisites: Ollama and Docker are required. For native GPU support on macOS, manual setup is recommended.
Dependencies: PyTorch, EasyOCR, Ollama, Redis, FastAPI.
Setup: make run or docker-compose up --build. GPU support requires docker-compose.gpu.yml.
Docs: demo.doctractor.com

Highlighted Details

Supports multiple OCR strategies: EasyOCR, MiniCPM-V, Llama Vision, and remote services.
Leverages Ollama for LLM integration, enabling PII removal and structured data extraction.
Offers flexible output formats (Markdown, JSON) and storage options (Local, Google Drive, S3).
Includes a CLI tool for direct interaction and task management.

Maintenance & Community

Active community support via Discord.
Contact: info@catchthetornado.com

Licensing & Compatibility

Licensed under the MIT License.
The marker-pdf strategy has GPL3 licensing and specific commercial use restrictions for its weights.

Limitations & Caveats

Docker on macOS does not currently support Apple GPUs, requiring native setup for GPU acceleration. The DISABLE_LOCAL_OLLAMA environment variable is not yet functional within Docker environments.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days