PDF-Extract-Kit  by opendatalab

PDF toolkit for high-quality content extraction

created 1 year ago
8,263 stars

Top 6.4% on sourcepulse

GitHubView on GitHub
Project Summary

This toolkit provides a modular, high-quality solution for extracting structured content from diverse PDF documents, targeting developers and researchers building document intelligence applications. It integrates state-of-the-art models for layout detection, formula recognition, OCR, and table parsing, enabling robust content extraction across various document types.

How It Works

PDF-Extract-Kit leverages a modular architecture, allowing users to combine and configure different models for specific extraction tasks. It fine-tunes state-of-the-art models (e.g., YOLO, LayoutLMv3, UniMERNet, StructEqTable) on diverse datasets to achieve high accuracy and robustness on complex documents like research papers, textbooks, and financial reports. This fine-tuning approach addresses the limitations of models trained solely on academic datasets.

Quick Start & Requirements

  • Install: Use conda to create an environment and pip install -r requirements.txt (or requirements-cpu.txt for CPU-only).
  • Prerequisites: Python 3.10. GPU support is recommended for optimal performance. Specific models may require additional installations (e.g., pip3 install doclayout-yolo==0.0.2 --extra-index-url=https://pypi.org/simple). Model weights must be downloaded separately.
  • Links: Tutorial, Hugging Face Models.

Highlighted Details

  • Integrates models for layout detection (DocLayout-YOLO, YOLO-v10, LayoutLMv3), formula detection (YOLOv8), formula recognition (UniMERNet), OCR (PaddleOCR), and table recognition (StructEqTable, InternVL2-1B).
  • Fine-tuned models demonstrate high robustness on diverse document types, including papers, textbooks, and financial reports, handling challenges like blurring and watermarks.
  • Offers modularity for flexible application construction via configuration files and minimal code.
  • Includes comprehensive evaluation benchmarks for model selection.

Maintenance & Community

The project is actively developed, with recent updates including the StructTable-InternVL2-1B table recognition model and the DocLayout-YOLO layout detection model. Community contributions are welcomed. A Discord server is available for community engagement.

Licensing & Compatibility

Licensed under AGPL-3.0. This license extends to components like YOLO and PyMuPDF, requiring any derivative works or linked applications to also be open-sourced under AGPL-3.0. Commercial use or linking with closed-source projects may be restricted due to the copyleft nature of the license.

Limitations & Caveats

The project is focused on content extraction and does not include functionality for reconstructing extracted content into new document formats (e.g., PDF to Markdown); for this, the MinerU project is recommended. Future features like reading order sorting are listed as "Coming Soon!".

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
3
Star History
778 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
2 more.

llmware by llmware-ai

0.2%
14k
Framework for enterprise RAG pipelines using small, specialized models
created 1 year ago
updated 1 week ago
Feedback? Help us improve.