PDF-Extract-Kit by opendatalab

PDF toolkit for high-quality content extraction

Created 1 year ago

9,391 stars

Top 5.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

This toolkit provides a modular, high-quality solution for extracting structured content from diverse PDF documents, targeting developers and researchers building document intelligence applications. It integrates state-of-the-art models for layout detection, formula recognition, OCR, and table parsing, enabling robust content extraction across various document types.

How It Works

PDF-Extract-Kit leverages a modular architecture, allowing users to combine and configure different models for specific extraction tasks. It fine-tunes state-of-the-art models (e.g., YOLO, LayoutLMv3, UniMERNet, StructEqTable) on diverse datasets to achieve high accuracy and robustness on complex documents like research papers, textbooks, and financial reports. This fine-tuning approach addresses the limitations of models trained solely on academic datasets.

Quick Start & Requirements

Install: Use conda to create an environment and pip install -r requirements.txt (or requirements-cpu.txt for CPU-only).
Prerequisites: Python 3.10. GPU support is recommended for optimal performance. Specific models may require additional installations (e.g., pip3 install doclayout-yolo==0.0.2 --extra-index-url=https://pypi.org/simple). Model weights must be downloaded separately.
Links: Tutorial, Hugging Face Models.

Highlighted Details

Integrates models for layout detection (DocLayout-YOLO, YOLO-v10, LayoutLMv3), formula detection (YOLOv8), formula recognition (UniMERNet), OCR (PaddleOCR), and table recognition (StructEqTable, InternVL2-1B).
Fine-tuned models demonstrate high robustness on diverse document types, including papers, textbooks, and financial reports, handling challenges like blurring and watermarks.
Offers modularity for flexible application construction via configuration files and minimal code.
Includes comprehensive evaluation benchmarks for model selection.

Maintenance & Community

The project is actively developed, with recent updates including the StructTable-InternVL2-1B table recognition model and the DocLayout-YOLO layout detection model. Community contributions are welcomed. A Discord server is available for community engagement.

Licensing & Compatibility

Licensed under AGPL-3.0. This license extends to components like YOLO and PyMuPDF, requiring any derivative works or linked applications to also be open-sourced under AGPL-3.0. Commercial use or linking with closed-source projects may be restricted due to the copyleft nature of the license.

Limitations & Caveats

The project is focused on content extraction and does not include functionality for reconstructing extracted content into new document formats (e.g., PDF to Markdown); for this, the MinerU project is recommended. Future features like reading order sorting are listed as "Coming Soon!".

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

252 stars in the last 30 days