gmft by conjuncts

PDF table extraction toolkit for converting tables to multiple formats

Created 1 year ago

521 stars

Top 60.3% on SourcePulse

Project Summary

This toolkit addresses the challenge of extracting structured tabular data from PDF documents, targeting researchers and developers who need robust and efficient table parsing. It offers a lightweight, performant solution leveraging advanced deep learning models for high-quality extraction across various formats.

How It Works

gmft utilizes Microsoft's Table Transformer (TATR) model, chosen for its superior performance and reliability, particularly on tables with implicit structures. The approach prioritizes speed by leveraging existing positional text data within PDFs, often bypassing the need for OCR. This focus on table extraction, rather than general document parsing, contributes to its efficiency.

Quick Start & Requirements

Install: pip install gmft
Prerequisites: PyTorch with CPU/GPU options. Downloads TATR model (~270MB) on first run.
Documentation: readthedocs
Demo: demo notebook

Highlighted Details

Leverages Microsoft's Table Transformer (TATR) for high extraction quality.
Achieves ~1.381 s/page on CPU, significantly faster than alternatives like unstructured or nougat.
Supports export to Pandas DataFrame, Markdown, LaTeX, CSV, JSON, cropped images, and table captions.
Minimal dependencies, avoiding complex installations like detectron2 or tesseract.

Maintenance & Community

MIT licensed.
Mentions PyPubTables1M authors and Hugging Face porting of TATR.
Links to comparison notebooks for use cases.

Licensing & Compatibility

MIT License.
PyMuPDF support is in a separate repository due to AGPL 3.0 license. Compatible with commercial use.

Limitations & Caveats

May falsely detect columnar text as tables, struggle with slightly askew tables, and occasionally produce false negatives. Experimental support for multi-indices and spanning cells exists.

gmft by conjuncts

Explore Similar Projects

opendataloader-pdf by opendataloader-project

tabled by VikParuchuri

spacy-layout by explosion

DeepSeek-OCR-Web by fufankeji

llmsherpa by nlmatics

pymupdf4llm by pymupdf

ollama_pdf_rag by tonykipkemboi

nlm-ingestor by nlmatics

pdf-craft by oomol-lab

PDF-Extract-Kit by opendatalab

olmocr by allenai

MinerU by opendatalab