gptpdf  by CosmosShadow

CLI tool for parsing PDFs into Markdown using GPT models

Created 1 year ago
3,530 stars

Top 13.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a Python library for parsing PDF documents into Markdown format using large language models (LLMs), specifically targeting the preservation of typography, math formulas, tables, and images. It's designed for users who need to extract and structure complex information from PDFs, offering an automated and cost-effective solution.

How It Works

The core approach leverages the PyMuPDF library to segment PDFs, identifying and marking non-textual elements. These segmented PDFs are then processed by multimodal LLMs (like GPT-4o) to generate a Markdown output. This method aims for near-perfect preservation of document structure and content, including complex elements like tables and formulas, by utilizing the visual understanding capabilities of advanced LLMs.

Quick Start & Requirements

  • Install via pip: pip install gptpdf
  • Requires an OpenAI API key (or compatible LLM endpoint).
  • Supports custom base URLs for alternative LLM providers (e.g., GLM-4V, Azure OpenAI).
  • A Google Colab notebook is available for quick testing: examples/gptpdf_Quick_Tour.ipynb

Highlighted Details

  • Claims an average cost of $0.013 per page.
  • The core parsing logic is contained within 293 lines of code.
  • Supports custom prompts to fine-tune LLM behavior for specific document types or parsing requirements.
  • Can parse various complex elements including typography, math formulas, tables, pictures, and charts.

Maintenance & Community

  • The project encourages community contributions via WeChat group chat.

Licensing & Compatibility

  • The license is not explicitly stated in the README.

Limitations & Caveats

  • The project relies heavily on external LLM APIs, making its performance and cost directly dependent on those services.
  • While aiming for perfect parsing, the accuracy of complex elements like intricate tables or handwritten notes may vary based on the LLM's capabilities and the quality of the input PDF.
Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
37 stars in the last 30 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Jason Huggins Jason Huggins(Creator of Selenium), and
2 more.

instructor-js by 567-labs

0.3%
753
Typescript tool for structured extraction from LLMs
Created 1 year ago
Updated 7 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
20 more.

markitdown by microsoft

6.7%
77k
Python tool for converting files to Markdown for LLM text analysis
Created 10 months ago
Updated 1 week ago
Feedback? Help us improve.