pdf-craft  by oomol-lab

CLI tool for converting PDFs, especially scanned books, into other formats

created 5 months ago
3,123 stars

Top 15.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides tools for converting PDF files, particularly scanned books, into structured formats like Markdown and EPUB. It targets researchers, students, and anyone needing to extract and repurpose content from PDF documents, offering local processing for Markdown conversion and LLM-assisted structuring for EPUBs.

How It Works

The core of PDF Craft utilizes a combination of AI models for document layout analysis (DocLayout-YOLO), text recognition (OnnxOCR), and reading order determination (layoutreader). For Markdown conversion, it processes pages locally, extracting text and filtering out headers/footers, with options to capture tables and formulas. For EPUB generation, it leverages LLMs to build book structure, incorporate tables of contents, and reformat citations, with the ability to correct OCR errors.

Quick Start & Requirements

  • Install: pip install pdf-craft and pip install onnxruntime==1.21.0
  • GPU Acceleration: Requires CUDA environment; install onnxruntime-gpu==1.21.0.
  • Python: 3.10+ (3.10.16 recommended).
  • LLM Configuration: For EPUB generation, an LLM service (e.g., DeepSeek) with API key and URL is required.
  • Formula/Table Extraction: Requires CUDA and latex installation for certain rendering modes.
  • Docs: English, 中文 Introduction

Highlighted Details

  • Local Markdown conversion: No remote LLM calls needed.
  • EPUB generation: LLM-assisted structuring, TOC creation, and citation handling.
  • Advanced OCR: Supports multiple OCR passes for improved quality.
  • Formula/Table Extraction: Can recognize and extract formulas (LaTeX) and tables (Markdown/HTML).
  • LLM Parameter Tuning: Supports temperature and top_p for controlling LLM output, with range-based retry mechanisms.
  • Analysis Request Splitting: Configurable window_tokens for managing LLM context.

Maintenance & Community

  • Dependencies: doc-page-extractor, DocLayout-YOLO, OnnxOCR, layoutreader, StructEqTable, LaTeX-OCR.
  • Community: Submit issues for problems or suggestions.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Commercial use and closed-source linking compatibility are not detailed.

Limitations & Caveats

EPUB generation requires external LLM services, making it not fully local. Certain advanced features like formula rendering via MathML may not be compatible with all EPUB readers. LaTeX installation is a prerequisite for SVG formula rendering.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
13
Issues (30d)
9
Star History
650 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Dan Guido Dan Guido(Cofounder of Trail of Bits), and
8 more.

markitdown by microsoft

0.9%
70k
Python tool for converting files to Markdown for LLM text analysis
created 8 months ago
updated 2 months ago
Feedback? Help us improve.