img2table  by xavctn

Python library for document table extraction

Created 4 years ago
866 stars

Top 40.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

img2table is a Python library designed for identifying and extracting tables from images and PDF documents. It provides a practical, CPU-friendly alternative to more resource-intensive neural network-based solutions, benefiting users who require efficient tabular data extraction from diverse document formats.

How It Works

The library leverages OpenCV for robust table structure detection, differentiating it from purely deep learning approaches. It integrates with multiple OCR engines (e.g., Tesseract, PaddleOCR, EasyOCR) to extract cell content, offering a flexible extraction pipeline. This design prioritizes performance and lower resource utilization, making it particularly suitable for CPU-bound environments.

Quick Start & Requirements

Installation is managed via pip: pip install img2table for standard Tesseract support, or pip install img2table[paddle], pip install img2table[easyocr], etc., for alternative OCR backends. Users must ensure their chosen OCR engine is separately installed and configured (e.g., Tesseract). PDF processing involves converting pages to images at 200 DPI. Skew and rotation detection are supported for angles up to 45 degrees, based on a method described by Huang (2020).

Highlighted Details

  • Supports heterogeneous documents, including native PDFs, scanned PDFs, and common image formats.
  • Capable of handling complex table structures, such as merged cells.
  • Offers extensive OCR integration options: Tesseract, PaddleOCR, EasyOCR, docTR, Surya, RapidOCR, Google Vision, AWS Textract, and Azure Cognitive Services.
  • Extracted tables are returned as objects with accessible Pandas DataFrame and HTML representations.
  • Provides direct export functionality for extracted tables to .xlsx files, preserving their original structure.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap.

Licensing & Compatibility

The specific open-source license for img2table is not explicitly stated in the README. Consequently, compatibility for commercial use or integration within closed-source projects remains undetermined.

Limitations & Caveats

Extraction accuracy is significantly dependent on the quality of the integrated OCR engine. Tables lacking discernible OCR data will not be returned. The library is optimized for documents with white or light backgrounds; performance on other document types is not guaranteed. For challenging cases where OpenCV-based detection proves insufficient, alternative CNN or LLM solutions may be necessary. Enabling the detect_rotation parameter may alter original image coordinates.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

MinerU by opendatalab

1.2%
65k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.