pdf-craft by oomol-lab

CLI tool for converting PDFs, especially scanned books, into other formats

Created 11 months ago

4,388 stars

Top 11.1% on SourcePulse

Project Summary

This project provides tools for converting PDF files, particularly scanned books, into structured formats like Markdown and EPUB. It targets researchers, students, and anyone needing to extract and repurpose content from PDF documents, offering local processing for Markdown conversion and LLM-assisted structuring for EPUBs.

How It Works

The core of PDF Craft utilizes a combination of AI models for document layout analysis (DocLayout-YOLO), text recognition (OnnxOCR), and reading order determination (layoutreader). For Markdown conversion, it processes pages locally, extracting text and filtering out headers/footers, with options to capture tables and formulas. For EPUB generation, it leverages LLMs to build book structure, incorporate tables of contents, and reformat citations, with the ability to correct OCR errors.

Quick Start & Requirements

Install: pip install pdf-craft and pip install onnxruntime==1.21.0
GPU Acceleration: Requires CUDA environment; install onnxruntime-gpu==1.21.0.
Python: 3.10+ (3.10.16 recommended).
LLM Configuration: For EPUB generation, an LLM service (e.g., DeepSeek) with API key and URL is required.
Formula/Table Extraction: Requires CUDA and latex installation for certain rendering modes.
Docs: English, 中文 Introduction

Highlighted Details

Local Markdown conversion: No remote LLM calls needed.
EPUB generation: LLM-assisted structuring, TOC creation, and citation handling.
Advanced OCR: Supports multiple OCR passes for improved quality.
Formula/Table Extraction: Can recognize and extract formulas (LaTeX) and tables (Markdown/HTML).
LLM Parameter Tuning: Supports temperature and top_p for controlling LLM output, with range-based retry mechanisms.
Analysis Request Splitting: Configurable window_tokens for managing LLM context.

Maintenance & Community

Dependencies: doc-page-extractor, DocLayout-YOLO, OnnxOCR, layoutreader, StructEqTable, LaTeX-OCR.
Community: Submit issues for problems or suggestions.

Licensing & Compatibility

License: Not explicitly stated in the README.
Compatibility: Commercial use and closed-source linking compatibility are not detailed.

Limitations & Caveats

EPUB generation requires external LLM services, making it not fully local. Certain advanced features like formula rendering via MathML may not be compatible with all EPUB readers. LaTeX installation is a prerequisite for SVG formula rendering.

pdf-craft by oomol-lab

Explore Similar Projects

Versatile-OCR-Program by ses4255

docutranslate by xunbu

vision-parse by iamarunbrahma

DeepSeek-OCR-Web by fufankeji

paperless-gpt by icereed

markpdfdown by MarkPDFdown

pymupdf4llm by pymupdf

nlm-ingestor by nlmatics

llm_aided_ocr by Dicklesworthstone

PolyglotPDF by CBIhalsen

MinerU by opendatalab

markitdown by microsoft