OCR enhancement tool using LLMs for scanned PDFs
Top 17.8% on sourcepulse
This project enhances Optical Character Recognition (OCR) output from scanned PDFs by using Large Language Models (LLMs) for error correction and formatting. It's designed for users who need to convert scanned documents into accurate, well-structured digital text, offering significant improvements over raw Tesseract output.
How It Works
The system first converts PDF pages into images, then applies Tesseract OCR for initial text extraction. The raw OCR text is segmented into overlapping chunks to preserve context. Each chunk is processed by an LLM to correct OCR errors and optionally format the text into Markdown. This approach leverages LLMs' natural language understanding to fix common OCR mistakes and improve overall readability.
Quick Start & Requirements
pyenv
, create a virtual environment, and install dependencies with pip install -r requirements.txt
.pdf2image
, pytesseract
. Optional: OpenAI or Anthropic API keys, or a compatible GGUF model for local LLM inference..env
file.input_pdf_file_path
in the main script, and run python llm_aided_ocr.py
.Highlighted Details
llama_cpp
) and cloud-based APIs (OpenAI, Anthropic).Maintenance & Community
The project is maintained by Dicklesworthstone. Contributions are welcomed via pull requests.
Licensing & Compatibility
Limitations & Caveats
The accuracy and speed of the correction process are highly dependent on the chosen LLM. Processing very large documents may be resource-intensive and time-consuming.
5 months ago
1 day