gptpdf by CosmosShadow

CLI tool for parsing PDFs into Markdown using GPT models

Created 1 year ago

3,559 stars

Top 13.6% on SourcePulse

Project Summary

This project provides a Python library for parsing PDF documents into Markdown format using large language models (LLMs), specifically targeting the preservation of typography, math formulas, tables, and images. It's designed for users who need to extract and structure complex information from PDFs, offering an automated and cost-effective solution.

How It Works

The core approach leverages the PyMuPDF library to segment PDFs, identifying and marking non-textual elements. These segmented PDFs are then processed by multimodal LLMs (like GPT-4o) to generate a Markdown output. This method aims for near-perfect preservation of document structure and content, including complex elements like tables and formulas, by utilizing the visual understanding capabilities of advanced LLMs.

Quick Start & Requirements

Install via pip: pip install gptpdf
Requires an OpenAI API key (or compatible LLM endpoint).
Supports custom base URLs for alternative LLM providers (e.g., GLM-4V, Azure OpenAI).
A Google Colab notebook is available for quick testing: examples/gptpdf_Quick_Tour.ipynb

Highlighted Details

Claims an average cost of $0.013 per page.
The core parsing logic is contained within 293 lines of code.
Supports custom prompts to fine-tune LLM behavior for specific document types or parsing requirements.
Can parse various complex elements including typography, math formulas, tables, pictures, and charts.

Maintenance & Community

The project encourages community contributions via WeChat group chat.

Licensing & Compatibility

The license is not explicitly stated in the README.

Limitations & Caveats

The project relies heavily on external LLM APIs, making its performance and cost directly dependent on those services.
While aiming for perfect parsing, the accuracy of complex elements like intricate tables or handwritten notes may vary based on the LLM's capabilities and the quality of the input PDF.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

13 stars in the last 30 days

Explore Similar Projects

llmdocparser by lazyFrogLOL

SDK for parsing PDFs and analyzing content using LLMs

Created 1 year ago

Updated 1 year ago

noted.md by tejas-raskar

Handwritten notes to Markdown CLI

Created 6 months ago

Updated 5 months ago

Versatile-OCR-Program by ses4255

OCR pipeline for ML training datasets from documents

Created 9 months ago

Updated 7 months ago

docutranslate by xunbu

Translate documents locally with AI across multiple formats

Created 8 months ago

Updated 17 hours ago

Starred by

Dan Guido

Dan Guido(Cofounder of Trail of Bits).

llm-based-ocr by yigitkonur

Open-source OCR API leveraging LLMs for document text extraction

Created 1 year ago

Updated 1 month ago

vision-parse by iamarunbrahma

CLI tool for parsing PDFs into markdown using vision LLMs

Created 1 year ago

Updated 3 months ago

e2m by wisupai

Python library for converting files to Markdown, targeting RAG/model training

Created 1 year ago

Updated 1 year ago

Starred by

John Resig

John Resig(Author of jQuery; Chief Software Architect at Khan Academy),

Jason Huggins

Jason Huggins(Creator of Selenium), and

2 more.

instructor-js by 567-labs

Typescript tool for structured extraction from LLMs

Created 2 years ago

Updated 11 months ago

co-op-translator by Azure

CLI tool for automating documentation translation using Azure AI

Created 1 year ago

Updated 2 days ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

ExtractThinker by enoch3712

Document intelligence library for LLMs, ORM-style interaction

Created 1 year ago

Updated 4 months ago

Starred by

Rodrigo Nader

Rodrigo Nader(Cofounder of Langflow),

Evan Conrad

Evan Conrad(Cofounder of SF Compute), and

2 more.

chunkr by lumina-ai-inc

Document intelligence API for RAG/LLM workflows

Created 1 year ago

Updated 3 months ago

PDFMathTranslate by PDFMathTranslate

CLI tool for PDF scientific paper translation, preserving format

Created 1 year ago

Updated 1 month ago

Feedback? Help us improve.