gmft  by conjuncts

PDF table extraction toolkit for converting tables to multiple formats

created 1 year ago
499 stars

Top 63.1% on sourcepulse

GitHubView on GitHub
Project Summary

This toolkit addresses the challenge of extracting structured tabular data from PDF documents, targeting researchers and developers who need robust and efficient table parsing. It offers a lightweight, performant solution leveraging advanced deep learning models for high-quality extraction across various formats.

How It Works

gmft utilizes Microsoft's Table Transformer (TATR) model, chosen for its superior performance and reliability, particularly on tables with implicit structures. The approach prioritizes speed by leveraging existing positional text data within PDFs, often bypassing the need for OCR. This focus on table extraction, rather than general document parsing, contributes to its efficiency.

Quick Start & Requirements

  • Install: pip install gmft
  • Prerequisites: PyTorch with CPU/GPU options. Downloads TATR model (~270MB) on first run.
  • Documentation: readthedocs
  • Demo: demo notebook

Highlighted Details

  • Leverages Microsoft's Table Transformer (TATR) for high extraction quality.
  • Achieves ~1.381 s/page on CPU, significantly faster than alternatives like unstructured or nougat.
  • Supports export to Pandas DataFrame, Markdown, LaTeX, CSV, JSON, cropped images, and table captions.
  • Minimal dependencies, avoiding complex installations like detectron2 or tesseract.

Maintenance & Community

  • MIT licensed.
  • Mentions PyPubTables1M authors and Hugging Face porting of TATR.
  • Links to comparison notebooks for use cases.

Licensing & Compatibility

  • MIT License.
  • PyMuPDF support is in a separate repository due to AGPL 3.0 license. Compatible with commercial use.

Limitations & Caveats

May falsely detect columnar text as tables, struggle with slightly askew tables, and occasionally produce false negatives. Experimental support for multi-indices and spanning cells exists.

Health Check
Last commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
3
Star History
43 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.