llm_aided_ocr by Dicklesworthstone

OCR enhancement tool using LLMs for scanned PDFs

Created 2 years ago

2,814 stars

Top 16.8% on SourcePulse

View on GitHub

3 Experts Love This Project

Project Summary

This project enhances Optical Character Recognition (OCR) output from scanned PDFs by using Large Language Models (LLMs) for error correction and formatting. It's designed for users who need to convert scanned documents into accurate, well-structured digital text, offering significant improvements over raw Tesseract output.

How It Works

The system first converts PDF pages into images, then applies Tesseract OCR for initial text extraction. The raw OCR text is segmented into overlapping chunks to preserve context. Each chunk is processed by an LLM to correct OCR errors and optionally format the text into Markdown. This approach leverages LLMs' natural language understanding to fix common OCR mistakes and improve overall readability.

Quick Start & Requirements

Install: Clone the repository, set up Python 3.12 via pyenv, create a virtual environment, and install dependencies with pip install -r requirements.txt.
Prerequisites: Python 3.12+, Tesseract OCR engine, pdf2image, pytesseract. Optional: OpenAI or Anthropic API keys, or a compatible GGUF model for local LLM inference.
Setup: Requires installing Tesseract OCR and configuring API keys or local LLM paths in a .env file.
Usage: Place PDF in the project directory, update input_pdf_file_path in the main script, and run python llm_aided_ocr.py.
Docs: Project Repository

Highlighted Details

Supports both local LLMs (via llama_cpp) and cloud-based APIs (OpenAI, Anthropic).
Features optional Markdown formatting, header/page number suppression, and quality assessment.
Employs asynchronous processing for API-based LLM calls to improve performance.
Includes adaptive token management to handle varying input lengths and model constraints.

Maintenance & Community

The project is maintained by Dicklesworthstone. Contributions are welcomed via pull requests.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive license allows for commercial use and integration with closed-source applications.

Limitations & Caveats

The accuracy and speed of the correction process are highly dependent on the chosen LLM. Processing very large documents may be resource-intensive and time-consuming.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

33 stars in the last 30 days