llm_aided_ocr  by Dicklesworthstone

OCR enhancement tool using LLMs for scanned PDFs

created 2 years ago
2,720 stars

Top 17.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project enhances Optical Character Recognition (OCR) output from scanned PDFs by using Large Language Models (LLMs) for error correction and formatting. It's designed for users who need to convert scanned documents into accurate, well-structured digital text, offering significant improvements over raw Tesseract output.

How It Works

The system first converts PDF pages into images, then applies Tesseract OCR for initial text extraction. The raw OCR text is segmented into overlapping chunks to preserve context. Each chunk is processed by an LLM to correct OCR errors and optionally format the text into Markdown. This approach leverages LLMs' natural language understanding to fix common OCR mistakes and improve overall readability.

Quick Start & Requirements

  • Install: Clone the repository, set up Python 3.12 via pyenv, create a virtual environment, and install dependencies with pip install -r requirements.txt.
  • Prerequisites: Python 3.12+, Tesseract OCR engine, pdf2image, pytesseract. Optional: OpenAI or Anthropic API keys, or a compatible GGUF model for local LLM inference.
  • Setup: Requires installing Tesseract OCR and configuring API keys or local LLM paths in a .env file.
  • Usage: Place PDF in the project directory, update input_pdf_file_path in the main script, and run python llm_aided_ocr.py.
  • Docs: Project Repository

Highlighted Details

  • Supports both local LLMs (via llama_cpp) and cloud-based APIs (OpenAI, Anthropic).
  • Features optional Markdown formatting, header/page number suppression, and quality assessment.
  • Employs asynchronous processing for API-based LLM calls to improve performance.
  • Includes adaptive token management to handle varying input lengths and model constraints.

Maintenance & Community

The project is maintained by Dicklesworthstone. Contributions are welcomed via pull requests.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license allows for commercial use and integration with closed-source applications.

Limitations & Caveats

The accuracy and speed of the correction process are highly dependent on the chosen LLM. Processing very large documents may be resource-intensive and time-consuming.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
106 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
2 more.

MegaParse by QuivrHQ

0.5%
7k
File parser optimized for LLM ingestion
created 1 year ago
updated 5 months ago
Feedback? Help us improve.