img2table by xavctn

Python library for document table extraction

Created 4 years ago

878 stars

Top 40.2% on SourcePulse

Project Summary

Summary

img2table is a Python library designed for identifying and extracting tables from images and PDF documents. It provides a practical, CPU-friendly alternative to more resource-intensive neural network-based solutions, benefiting users who require efficient tabular data extraction from diverse document formats.

How It Works

The library leverages OpenCV for robust table structure detection, differentiating it from purely deep learning approaches. It integrates with multiple OCR engines (e.g., Tesseract, PaddleOCR, EasyOCR) to extract cell content, offering a flexible extraction pipeline. This design prioritizes performance and lower resource utilization, making it particularly suitable for CPU-bound environments.

Quick Start & Requirements

Installation is managed via pip: pip install img2table for standard Tesseract support, or pip install img2table[paddle], pip install img2table[easyocr], etc., for alternative OCR backends. Users must ensure their chosen OCR engine is separately installed and configured (e.g., Tesseract). PDF processing involves converting pages to images at 200 DPI. Skew and rotation detection are supported for angles up to 45 degrees, based on a method described by Huang (2020).

Highlighted Details

Supports heterogeneous documents, including native PDFs, scanned PDFs, and common image formats.
Capable of handling complex table structures, such as merged cells.
Offers extensive OCR integration options: Tesseract, PaddleOCR, EasyOCR, docTR, Surya, RapidOCR, Google Vision, AWS Textract, and Azure Cognitive Services.
Extracted tables are returned as objects with accessible Pandas DataFrame and HTML representations.
Provides direct export functionality for extracted tables to .xlsx files, preserving their original structure.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap.

Licensing & Compatibility

The specific open-source license for img2table is not explicitly stated in the README. Consequently, compatibility for commercial use or integration within closed-source projects remains undetermined.

Limitations & Caveats

Extraction accuracy is significantly dependent on the quality of the integrated OCR engine. Tables lacking discernible OCR data will not be returned. The library is optimized for documents with white or light backgrounds; performance on other document types is not guaranteed. For challenging cases where OpenCV-based detection proves insufficient, alternative CNN or LLM solutions may be necessary. Enabling the detect_rotation parameter may alter original image coordinates.

img2table by xavctn

Explore Similar Projects

rowfill by harishdeivanayagam

ParseStudio by chatclimate-ai

StructEqTable-Deploy by InternScience

gmft by conjuncts

Tabular-LLM by SpursGoZmy

TurboOCR by aiptimizer

tabled by VikParuchuri

spacy-layout by explosion

awesome-ocr by zacharywhitley

xberg by xberg-io

PyMuPDF by pymupdf

surya by datalab-to