PDF-Extract-Kit  by opendatalab

PDF toolkit for high-quality content extraction

Created 1 year ago
8,632 stars

Top 6.0% on SourcePulse

GitHubView on GitHub
Project Summary

This toolkit provides a modular, high-quality solution for extracting structured content from diverse PDF documents, targeting developers and researchers building document intelligence applications. It integrates state-of-the-art models for layout detection, formula recognition, OCR, and table parsing, enabling robust content extraction across various document types.

How It Works

PDF-Extract-Kit leverages a modular architecture, allowing users to combine and configure different models for specific extraction tasks. It fine-tunes state-of-the-art models (e.g., YOLO, LayoutLMv3, UniMERNet, StructEqTable) on diverse datasets to achieve high accuracy and robustness on complex documents like research papers, textbooks, and financial reports. This fine-tuning approach addresses the limitations of models trained solely on academic datasets.

Quick Start & Requirements

  • Install: Use conda to create an environment and pip install -r requirements.txt (or requirements-cpu.txt for CPU-only).
  • Prerequisites: Python 3.10. GPU support is recommended for optimal performance. Specific models may require additional installations (e.g., pip3 install doclayout-yolo==0.0.2 --extra-index-url=https://pypi.org/simple). Model weights must be downloaded separately.
  • Links: Tutorial, Hugging Face Models.

Highlighted Details

  • Integrates models for layout detection (DocLayout-YOLO, YOLO-v10, LayoutLMv3), formula detection (YOLOv8), formula recognition (UniMERNet), OCR (PaddleOCR), and table recognition (StructEqTable, InternVL2-1B).
  • Fine-tuned models demonstrate high robustness on diverse document types, including papers, textbooks, and financial reports, handling challenges like blurring and watermarks.
  • Offers modularity for flexible application construction via configuration files and minimal code.
  • Includes comprehensive evaluation benchmarks for model selection.

Maintenance & Community

The project is actively developed, with recent updates including the StructTable-InternVL2-1B table recognition model and the DocLayout-YOLO layout detection model. Community contributions are welcomed. A Discord server is available for community engagement.

Licensing & Compatibility

Licensed under AGPL-3.0. This license extends to components like YOLO and PyMuPDF, requiring any derivative works or linked applications to also be open-sourced under AGPL-3.0. Commercial use or linking with closed-source projects may be restricted due to the copyleft nature of the license.

Limitations & Caveats

The project is focused on content extraction and does not include functionality for reconstructing extracted content into new document formats (e.g., PDF to Markdown); for this, the MinerU project is recommended. Future features like reading order sorting are listed as "Coming Soon!".

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
283 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.