opendataloader-pdf by opendataloader-project

PDF data extraction for AI

Created 9 months ago

851 stars

Top 41.8% on SourcePulse

Project Summary

OpenDataLoader PDF is a high-performance, local-first tool designed to convert PDF documents into structured formats like JSON, Markdown, or HTML, specifically for AI applications such as LLMs, vector search, and RAG. It reconstructs document layout, including headings, lists, and tables, to facilitate efficient content chunking, indexing, and querying, while incorporating AI-safety features to filter potentially harmful embedded content. The project targets developers and researchers working with document processing pipelines who require robust, privacy-preserving, and efficient PDF data extraction.

How It Works

The project employs a fast, heuristic, rule-based inference engine to parse PDFs, avoiding the need for GPUs and enabling high-throughput local processing. It focuses on reconstructing the semantic structure of documents, including reading order, headings, lists, and tables, which is crucial for meaningful data chunking and retrieval in AI systems. This approach prioritizes privacy and security by running entirely on the user's machine and includes built-in AI-safety measures to mitigate risks from prompt-injection content within PDFs.

Quick Start & Requirements

Python: pip install -U opendataloader-pdf

import opendataloader_pdf
opendataloader_pdf.run(input_path="path/to/document.pdf", generate_markdown=True)

Node.js: npm install @opendataloader/pdf
```
import { run } from '@opendataloader/pdf';
await run('path/to/document.pdf', { generateMarkdown: true });
```
Note: The Node.js package is a wrapper around a Java CLI and is not intended for browser-based frontends.
Java: Integrate via Maven dependency (org.opendataloader:opendataloader-pdf-core) or use the provided CLI JAR.
Prerequisites: Java 11 or higher (required for all platforms), Python 3.9+.
Docker: Available via ghcr.io/opendataloader-project/opendataloader-pdf-cli.
Setup: Requires Java and Python/Node.js installations.

Highlighted Details

Rich Output Formats: Generates JSON, Markdown, and HTML, with options to embed HTML in Markdown or include images.
Layout Reconstruction: Accurately identifies and structures headings, lists, tables, and reading order.
AI-Safety: Automatically filters prompt-injection content embedded within PDFs.
Annotated PDF Visualization: Option to generate PDFs with detected structures overlaid for visual inspection.
Local-First & Privacy: Processes all data locally, ensuring data privacy.

Maintenance & Community

The project encourages community involvement through GitHub Discussions for Q&A and GitHub Issues for bug reporting. Contribution guidelines are available in CONTRIBUTING.md. The project also outlines specific rules for using its branding and trademarks.

Licensing & Compatibility

This project is licensed under the Mozilla Public License 2.0 (MPL 2.0). This license is generally permissive for use and modification but requires that any distributed modifications to the licensed code itself remain under the MPL 2.0.

Limitations & Caveats

Currently, the project does not support Optical Character Recognition (OCR) for scanned PDFs; this functionality is scheduled for December. Advanced table extraction for complex tables is also planned for December. The project is actively under development, with significant features like OCR and performance enhancements slated for late 2023 and beyond.

Health Check

Last Commit

22 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

23 stars in the last 30 days