pdf-document-layout-analysis  by huridocs

Intelligent PDF document analysis and content extraction service

Created 1 year ago
727 stars

Top 47.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This Docker-powered microservice offers advanced PDF document layout analysis, OCR, and content extraction. It segments and classifies PDF elements like text, titles, images, and tables, determines reading order, and converts documents to formats such as Markdown and HTML, with integrated translation capabilities. It benefits users by providing a flexible, automated solution for complex PDF processing tasks.

How It Works

The service employs a Clean Architecture design for maintainability and testability. It offers two primary analysis models: the Vision Grid Transformer (VGT) for high-accuracy visual layout understanding, and LightGBM models for faster processing using XML-based features from Poppler. Integrated Tesseract OCR supports over 150 languages. A comprehensive RESTful API exposes functionalities for analysis, extraction, format conversion, and OCR.

Quick Start & Requirements

To start the service, use make start (or make start_translation for translation features). The service is accessible at http://localhost:5060. Prerequisites include Docker Desktop 4.25.0+ and Python 3.10+ for development. Optional NVIDIA Container Toolkit is recommended for GPU acceleration. System requirements are 2 GB RAM minimum, 5 GB GPU memory (optional), and 10 GB disk space. Project links to GitHub, HuggingFace, and Docker Hub are provided.

Highlighted Details

  • Advanced Layout Analysis: Accurately segments and classifies diverse PDF content types.
  • Dual Model Strategy: Offers VGT for superior accuracy and LightGBM for speed and efficiency.
  • Multi-Format Output: Exports results as JSON, Markdown, or HTML, including segmentation data and extracted images.
  • Integrated Translation: Leverages Ollama models for automatic document translation into multiple languages.
  • Extensive OCR: Supports 150+ languages via Tesseract, with options for specific language packs.
  • Specialized Extraction: Capable of extracting tables as HTML and mathematical formulas as LaTeX.
  • Comprehensive API: Provides over 10 RESTful endpoints for granular control.

Maintenance & Community

The project is developed by HURIDOCS. Specific details regarding community channels (e.g., Discord, Slack), active contributors, or sponsorships are not detailed in the provided README.

Licensing & Compatibility

The specific open-source license is not explicitly stated in the provided README. Compatibility is enhanced by its Docker-based deployment, facilitating easier integration into various environments.

Limitations & Caveats

The quality of automatic translations is dependent on the chosen Ollama model; smaller models may yield suboptimal results. While GPU support is optional, the VGT model's performance is significantly enhanced by it. The specific license for commercial use or redistribution is not detailed.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
4
Star History
25 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.