opendataloader-pdf  by opendataloader-project

PDF data extraction for AI

Created 5 months ago
710 stars

Top 48.2% on SourcePulse

GitHubView on GitHub
Project Summary

OpenDataLoader PDF is a high-performance, local-first tool designed to convert PDF documents into structured formats like JSON, Markdown, or HTML, specifically for AI applications such as LLMs, vector search, and RAG. It reconstructs document layout, including headings, lists, and tables, to facilitate efficient content chunking, indexing, and querying, while incorporating AI-safety features to filter potentially harmful embedded content. The project targets developers and researchers working with document processing pipelines who require robust, privacy-preserving, and efficient PDF data extraction.

How It Works

The project employs a fast, heuristic, rule-based inference engine to parse PDFs, avoiding the need for GPUs and enabling high-throughput local processing. It focuses on reconstructing the semantic structure of documents, including reading order, headings, lists, and tables, which is crucial for meaningful data chunking and retrieval in AI systems. This approach prioritizes privacy and security by running entirely on the user's machine and includes built-in AI-safety measures to mitigate risks from prompt-injection content within PDFs.

Quick Start & Requirements

  • Python: pip install -U opendataloader-pdf
    import opendataloader_pdf
    opendataloader_pdf.run(input_path="path/to/document.pdf", generate_markdown=True)
    
  • Node.js: npm install @opendataloader/pdf
    import { run } from '@opendataloader/pdf';
    await run('path/to/document.pdf', { generateMarkdown: true });
    
    Note: The Node.js package is a wrapper around a Java CLI and is not intended for browser-based frontends.
  • Java: Integrate via Maven dependency (org.opendataloader:opendataloader-pdf-core) or use the provided CLI JAR.
  • Prerequisites: Java 11 or higher (required for all platforms), Python 3.9+.
  • Docker: Available via ghcr.io/opendataloader-project/opendataloader-pdf-cli.
  • Setup: Requires Java and Python/Node.js installations.

Highlighted Details

  • Rich Output Formats: Generates JSON, Markdown, and HTML, with options to embed HTML in Markdown or include images.
  • Layout Reconstruction: Accurately identifies and structures headings, lists, tables, and reading order.
  • AI-Safety: Automatically filters prompt-injection content embedded within PDFs.
  • Annotated PDF Visualization: Option to generate PDFs with detected structures overlaid for visual inspection.
  • Local-First & Privacy: Processes all data locally, ensuring data privacy.

Maintenance & Community

The project encourages community involvement through GitHub Discussions for Q&A and GitHub Issues for bug reporting. Contribution guidelines are available in CONTRIBUTING.md. The project also outlines specific rules for using its branding and trademarks.

Licensing & Compatibility

This project is licensed under the Mozilla Public License 2.0 (MPL 2.0). This license is generally permissive for use and modification but requires that any distributed modifications to the licensed code itself remain under the MPL 2.0.

Limitations & Caveats

Currently, the project does not support Optical Character Recognition (OCR) for scanned PDFs; this functionality is scheduled for December. Advanced table extraction for complex tables is also planned for December. The project is actively under development, with significant features like OCR and performance enhancements slated for late 2023 and beyond.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
11
Issues (30d)
1
Star History
49 stars in the last 30 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

MinerU by opendatalab

0.9%
48k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.