tabled by VikParuchuri

Table extraction library (deprecated, functionality moved to `marker`)

Created 1 year ago

754 stars

Top 46.1% on SourcePulse

1 Expert Loves This Project

bryanhelmig

Cofounder of Zapier

Project Summary

This library extracts tables from PDFs and images into Markdown, CSV, or HTML formats. It's designed for researchers and developers needing to process tabular data embedded in documents, offering automated detection, layout analysis, and cell formatting.

How It Works

Tabled leverages the Surya library for initial table detection within documents. It then employs a layout analysis model to identify rows and columns, followed by a recognition model to extract and format cell content. This multi-stage approach aims for high accuracy in parsing complex table structures.

Quick Start & Requirements

Install with: pip install tabled-pdf
Requires Python 3.10+ and PyTorch.
Model weights download automatically on first run.
Official documentation: https://github.com/VikParuchuri/tabled

Highlighted Details

Achieves an 0.847 alignment score against GPT-4 table predictions.
Processes tables at an average of 0.029 seconds per table on an A10G GPU.
Supports PDF, image, Word, and PowerPoint inputs.
Offers a Streamlit GUI for interactive use.

Maintenance & Community

The project is deprecated, with functionality migrated to marker.
Community discussions are hosted on Discord.

Licensing & Compatibility

Model weights are licensed under CC-BY-NC-SA-4.0.
Commercial use is permitted for organizations under $5M USD revenue and VC funding, or via a commercial license.
Dual-licensing options are available for commercial use.

Limitations & Caveats

The project is officially deprecated, recommending migration to marker for continued development and support.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

gmft by conjuncts

PDF table extraction toolkit for converting tables to multiple formats

Created 1 year ago

Updated 3 days ago

vision-parse by iamarunbrahma

CLI tool for parsing PDFs into markdown using vision LLMs

Created 1 year ago

Updated 4 months ago

DeepSeek-OCR-Web by fufankeji

Multimodal document parsing studio for PDFs and images

Created 4 months ago

Updated 4 months ago

Starred by

Dharmesh Shah

Dharmesh Shah(Cofounder of HubSpot).

thepipe by emcf

SDK for extracting data from documents

Created 1 year ago

Updated 4 months ago

knowledge-table by whyhow-ai

Open-source package for structured data extraction from unstructured documents

Created 1 year ago

Updated 1 year ago

Starred by

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT),

Clément Renault

Clément Renault(Cofounder of Meilisearch), and

1 more.

qsv by dathere

CLI tool for blazing-fast CSV data-wrangling

Created 5 years ago

Updated 1 day ago

docstrange by NanoNets

Extract and convert data from any document to multiple formats

Created 6 months ago

Updated 3 months ago

OmniDocBench by opendatalab

Benchmark for document parsing and evaluation in real-world scenarios

Created 1 year ago

Updated 2 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Rodrigo Nader

Rodrigo Nader(Cofounder of Langflow), and

2 more.

llmsherpa by nlmatics

Developer APIs for LLM project acceleration

Created 2 years ago

Updated 1 year ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai) and

Tim Suchanek

Tim Suchanek(Founder of expand.ai).

nlm-ingestor by nlmatics

Server for LLM ingestion via API, enabling custom RAG parsing

Created 2 years ago

Updated 11 months ago

pdf-craft by oomol-lab

CLI tool for converting PDFs, especially scanned books, into other formats

Created 1 year ago

Updated 1 week ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Ben Firshman

Ben Firshman(Cofounder of Replicate), and

17 more.

marker by datalab-to

CLI tool for converting PDFs and other documents to Markdown, JSON, and HTML

Created 2 years ago

Updated 2 weeks ago

Feedback? Help us improve.