PDF table extraction toolkit for converting tables to multiple formats
Top 63.1% on sourcepulse
This toolkit addresses the challenge of extracting structured tabular data from PDF documents, targeting researchers and developers who need robust and efficient table parsing. It offers a lightweight, performant solution leveraging advanced deep learning models for high-quality extraction across various formats.
How It Works
gmft utilizes Microsoft's Table Transformer (TATR) model, chosen for its superior performance and reliability, particularly on tables with implicit structures. The approach prioritizes speed by leveraging existing positional text data within PDFs, often bypassing the need for OCR. This focus on table extraction, rather than general document parsing, contributes to its efficiency.
Quick Start & Requirements
pip install gmft
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
May falsely detect columnar text as tables, struggle with slightly askew tables, and occasionally produce false negatives. Experimental support for multi-indices and spanning cells exists.
5 days ago
1 day