gptpdf  by CosmosShadow

CLI tool for parsing PDFs into Markdown using GPT models

created 1 year ago
3,485 stars

Top 14.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Python library for parsing PDF documents into Markdown format using large language models (LLMs), specifically targeting the preservation of typography, math formulas, tables, and images. It's designed for users who need to extract and structure complex information from PDFs, offering an automated and cost-effective solution.

How It Works

The core approach leverages the PyMuPDF library to segment PDFs, identifying and marking non-textual elements. These segmented PDFs are then processed by multimodal LLMs (like GPT-4o) to generate a Markdown output. This method aims for near-perfect preservation of document structure and content, including complex elements like tables and formulas, by utilizing the visual understanding capabilities of advanced LLMs.

Quick Start & Requirements

  • Install via pip: pip install gptpdf
  • Requires an OpenAI API key (or compatible LLM endpoint).
  • Supports custom base URLs for alternative LLM providers (e.g., GLM-4V, Azure OpenAI).
  • A Google Colab notebook is available for quick testing: examples/gptpdf_Quick_Tour.ipynb

Highlighted Details

  • Claims an average cost of $0.013 per page.
  • The core parsing logic is contained within 293 lines of code.
  • Supports custom prompts to fine-tune LLM behavior for specific document types or parsing requirements.
  • Can parse various complex elements including typography, math formulas, tables, pictures, and charts.

Maintenance & Community

  • The project encourages community contributions via WeChat group chat.

Licensing & Compatibility

  • The license is not explicitly stated in the README.

Limitations & Caveats

  • The project relies heavily on external LLM APIs, making its performance and cost directly dependent on those services.
  • While aiming for perfect parsing, the accuracy of complex elements like intricate tables or handwritten notes may vary based on the LLM's capabilities and the quality of the input PDF.
Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
107 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Dan Guido Dan Guido(Cofounder of Trail of Bits), and
8 more.

markitdown by microsoft

0.9%
70k
Python tool for converting files to Markdown for LLM text analysis
created 8 months ago
updated 2 months ago
Feedback? Help us improve.