pdf_oxide by yfedoseev

High-performance PDF toolkit for diverse applications

Created 7 months ago

817 stars

Top 43.0% on SourcePulse

Project Summary

PDF Oxide is a high-performance PDF processing toolkit built with a Rust core and available for Python, Rust, WASM, and CLI. It addresses the need for fast and reliable text and image extraction, markdown conversion, and PDF manipulation, targeting developers and researchers who require efficient document processing for applications like RAG/LLM pipelines, AI assistants, and large-scale data extraction. Its primary benefit is significantly faster processing speeds and higher reliability compared to existing libraries, coupled with a permissive license.

How It Works

The project leverages a Rust backend for core PDF parsing and manipulation, providing exceptional speed and memory efficiency. This is exposed through native Rust APIs, Python bindings (using maturin), and WebAssembly for browser/Node.js environments. A dedicated CLI tool and an MCP server for AI assistants further broaden its applicability. This multi-faceted approach ensures high performance across various platforms and use cases, with the Rust core being the key differentiator for its speed and reliability.

Quick Start & Requirements

Python: pip install pdf_oxide. Supports Python 3.8–3.14. Wheels are available for Linux, macOS, and Windows.
Rust: Add pdf_oxide = "0.3" to Cargo.toml.
CLI: Install via Homebrew (brew install yfedoseev/tap/pdf-oxide) or Cargo (cargo install pdf_oxide_cli).
WASM: npm install pdf-oxide-wasm.
MCP Server: Install via Homebrew or Cargo (cargo install pdf_oxide_mcp). Configuration details for AI assistants like Claude and Cursor are provided.
Building from Source: Requires cloning the repository, cargo build --release, and maturin develop for Python bindings.

Highlighted Details

Performance: Achieves a 0.8ms mean processing time per document, reported as 5x faster than PyMuPDF and 15x faster than pypdf.
Reliability: Boasts a 100% pass rate on a corpus of 3,830 real-world PDFs, including benchmarks against veraPDF, Mozilla pdf.js, and DARPA SafeDocs.
Text Quality: Demonstrates 99.5% text parity with PyMuPDF and pypdfium2, and extracts text from significantly more challenging files.
Features: Supports text, image, form field, annotation, bookmark, and table extraction; PDF creation and editing; Markdown and HTML conversion; scoped, word, and line-level extraction.

Maintenance & Community

The project is maintained by Yury Fedoseev, with the source code available on GitHub. No specific community channels (e.g., Discord, Slack) or sponsorship details are mentioned in the README.

Licensing & Compatibility

PDF Oxide is dual-licensed under MIT or Apache-2.0, allowing for free use in both commercial and open-source projects without the copyleft restrictions found in AGPL-licensed alternatives.

Limitations & Caveats

The library is at version 0.3.14, indicating active development. While it claims a 100% pass rate on valid PDFs, the README notes specific intentionally broken test fixtures that do not pass. No other explicit limitations or unsupported platforms are detailed.

pdf_oxide by yfedoseev

Explore Similar Projects

alfred-workflows by zeitlings

DeepReviewer-v2 by ResearAI

tome by tomehq

ParseStudio by chatclimate-ai

wps-skills by lc2panda

simplepdf-embed by SimplePDF

codexia by milisp

docling-mcp by docling-project

DevDocs by cyberagiinc

Office-Word-MCP-Server by GongRzhe

kreuzberg by kreuzberg-dev

PyMuPDF by pymupdf