Discover and explore top open-source AI tools and projects—updated daily.
xavctnPython library for document table extraction
Top 40.9% on SourcePulse
Summary
img2table is a Python library designed for identifying and extracting tables from images and PDF documents. It provides a practical, CPU-friendly alternative to more resource-intensive neural network-based solutions, benefiting users who require efficient tabular data extraction from diverse document formats.
How It Works
The library leverages OpenCV for robust table structure detection, differentiating it from purely deep learning approaches. It integrates with multiple OCR engines (e.g., Tesseract, PaddleOCR, EasyOCR) to extract cell content, offering a flexible extraction pipeline. This design prioritizes performance and lower resource utilization, making it particularly suitable for CPU-bound environments.
Quick Start & Requirements
Installation is managed via pip: pip install img2table for standard Tesseract support, or pip install img2table[paddle], pip install img2table[easyocr], etc., for alternative OCR backends. Users must ensure their chosen OCR engine is separately installed and configured (e.g., Tesseract). PDF processing involves converting pages to images at 200 DPI. Skew and rotation detection are supported for angles up to 45 degrees, based on a method described by Huang (2020).
Highlighted Details
.xlsx files, preserving their original structure.Maintenance & Community
The provided README does not contain specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap.
Licensing & Compatibility
The specific open-source license for img2table is not explicitly stated in the README. Consequently, compatibility for commercial use or integration within closed-source projects remains undetermined.
Limitations & Caveats
Extraction accuracy is significantly dependent on the quality of the integrated OCR engine. Tables lacking discernible OCR data will not be returned. The library is optimized for documents with white or light backgrounds; performance on other document types is not guaranteed. For challenging cases where OpenCV-based detection proves insufficient, alternative CNN or LLM solutions may be necessary. Enabling the detect_rotation parameter may alter original image coordinates.
2 weeks ago
Inactive
opendatalab