pdf_oxide  by yfedoseev

High-performance PDF toolkit for diverse applications

Created 4 months ago
419 stars

Top 70.3% on SourcePulse

GitHubView on GitHub
Project Summary

PDF Oxide is a high-performance PDF processing toolkit built with a Rust core and available for Python, Rust, WASM, and CLI. It addresses the need for fast and reliable text and image extraction, markdown conversion, and PDF manipulation, targeting developers and researchers who require efficient document processing for applications like RAG/LLM pipelines, AI assistants, and large-scale data extraction. Its primary benefit is significantly faster processing speeds and higher reliability compared to existing libraries, coupled with a permissive license.

How It Works

The project leverages a Rust backend for core PDF parsing and manipulation, providing exceptional speed and memory efficiency. This is exposed through native Rust APIs, Python bindings (using maturin), and WebAssembly for browser/Node.js environments. A dedicated CLI tool and an MCP server for AI assistants further broaden its applicability. This multi-faceted approach ensures high performance across various platforms and use cases, with the Rust core being the key differentiator for its speed and reliability.

Quick Start & Requirements

  • Python: pip install pdf_oxide. Supports Python 3.8–3.14. Wheels are available for Linux, macOS, and Windows.
  • Rust: Add pdf_oxide = "0.3" to Cargo.toml.
  • CLI: Install via Homebrew (brew install yfedoseev/tap/pdf-oxide) or Cargo (cargo install pdf_oxide_cli).
  • WASM: npm install pdf-oxide-wasm.
  • MCP Server: Install via Homebrew or Cargo (cargo install pdf_oxide_mcp). Configuration details for AI assistants like Claude and Cursor are provided.
  • Building from Source: Requires cloning the repository, cargo build --release, and maturin develop for Python bindings.

Highlighted Details

  • Performance: Achieves a 0.8ms mean processing time per document, reported as 5x faster than PyMuPDF and 15x faster than pypdf.
  • Reliability: Boasts a 100% pass rate on a corpus of 3,830 real-world PDFs, including benchmarks against veraPDF, Mozilla pdf.js, and DARPA SafeDocs.
  • Text Quality: Demonstrates 99.5% text parity with PyMuPDF and pypdfium2, and extracts text from significantly more challenging files.
  • Features: Supports text, image, form field, annotation, bookmark, and table extraction; PDF creation and editing; Markdown and HTML conversion; scoped, word, and line-level extraction.

Maintenance & Community

The project is maintained by Yury Fedoseev, with the source code available on GitHub. No specific community channels (e.g., Discord, Slack) or sponsorship details are mentioned in the README.

Licensing & Compatibility

PDF Oxide is dual-licensed under MIT or Apache-2.0, allowing for free use in both commercial and open-source projects without the copyleft restrictions found in AGPL-licensed alternatives.

Limitations & Caveats

The library is at version 0.3.14, indicating active development. While it claims a 100% pass rate on valid PDFs, the README notes specific intentionally broken test fixtures that do not pass. No other explicit limitations or unsupported platforms are detailed.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
50
Issues (30d)
158
Star History
396 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

olmocr by allenai

0.2%
17k
Toolkit for linearizing PDFs for LLM datasets/training
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.