CLI tool for converting PDFs, especially scanned books, into other formats
Top 15.7% on sourcepulse
This project provides tools for converting PDF files, particularly scanned books, into structured formats like Markdown and EPUB. It targets researchers, students, and anyone needing to extract and repurpose content from PDF documents, offering local processing for Markdown conversion and LLM-assisted structuring for EPUBs.
How It Works
The core of PDF Craft utilizes a combination of AI models for document layout analysis (DocLayout-YOLO), text recognition (OnnxOCR), and reading order determination (layoutreader). For Markdown conversion, it processes pages locally, extracting text and filtering out headers/footers, with options to capture tables and formulas. For EPUB generation, it leverages LLMs to build book structure, incorporate tables of contents, and reformat citations, with the ability to correct OCR errors.
Quick Start & Requirements
pip install pdf-craft
and pip install onnxruntime==1.21.0
onnxruntime-gpu==1.21.0
.latex
installation for certain rendering modes.Highlighted Details
temperature
and top_p
for controlling LLM output, with range-based retry mechanisms.window_tokens
for managing LLM context.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
EPUB generation requires external LLM services, making it not fully local. Certain advanced features like formula rendering via MathML may not be compatible with all EPUB readers. LaTeX installation is a prerequisite for SVG formula rendering.
1 week ago
1 day