pdf-craft  by oomol-lab

CLI tool for converting PDFs, especially scanned books, into other formats

Created 7 months ago
3,220 stars

Top 15.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides tools for converting PDF files, particularly scanned books, into structured formats like Markdown and EPUB. It targets researchers, students, and anyone needing to extract and repurpose content from PDF documents, offering local processing for Markdown conversion and LLM-assisted structuring for EPUBs.

How It Works

The core of PDF Craft utilizes a combination of AI models for document layout analysis (DocLayout-YOLO), text recognition (OnnxOCR), and reading order determination (layoutreader). For Markdown conversion, it processes pages locally, extracting text and filtering out headers/footers, with options to capture tables and formulas. For EPUB generation, it leverages LLMs to build book structure, incorporate tables of contents, and reformat citations, with the ability to correct OCR errors.

Quick Start & Requirements

  • Install: pip install pdf-craft and pip install onnxruntime==1.21.0
  • GPU Acceleration: Requires CUDA environment; install onnxruntime-gpu==1.21.0.
  • Python: 3.10+ (3.10.16 recommended).
  • LLM Configuration: For EPUB generation, an LLM service (e.g., DeepSeek) with API key and URL is required.
  • Formula/Table Extraction: Requires CUDA and latex installation for certain rendering modes.
  • Docs: English, 中文 Introduction

Highlighted Details

  • Local Markdown conversion: No remote LLM calls needed.
  • EPUB generation: LLM-assisted structuring, TOC creation, and citation handling.
  • Advanced OCR: Supports multiple OCR passes for improved quality.
  • Formula/Table Extraction: Can recognize and extract formulas (LaTeX) and tables (Markdown/HTML).
  • LLM Parameter Tuning: Supports temperature and top_p for controlling LLM output, with range-based retry mechanisms.
  • Analysis Request Splitting: Configurable window_tokens for managing LLM context.

Maintenance & Community

  • Dependencies: doc-page-extractor, DocLayout-YOLO, OnnxOCR, layoutreader, StructEqTable, LaTeX-OCR.
  • Community: Submit issues for problems or suggestions.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Commercial use and closed-source linking compatibility are not detailed.

Limitations & Caveats

EPUB generation requires external LLM services, making it not fully local. Certain advanced features like formula rendering via MathML may not be compatible with all EPUB readers. LaTeX installation is a prerequisite for SVG formula rendering.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
4
Star History
56 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
1 more.

MinerU by opendatalab

1.2%
44k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
20 more.

markitdown by microsoft

6.7%
77k
Python tool for converting files to Markdown for LLM text analysis
Created 10 months ago
Updated 1 week ago
Feedback? Help us improve.