llmdocparser  by lazyFrogLOL

SDK for parsing PDFs and analyzing content using LLMs

created 1 year ago
272 stars

Top 95.5% on sourcepulse

GitHubView on GitHub
Project Summary

This package parses PDF documents by identifying distinct content regions (text, figures, tables, etc.) using a layout analysis model. It then feeds images of these regions to multimodal LLMs like GPT-4o or Qwen-VL to extract structured text, making it suitable for RAG applications.

How It Works

The core approach leverages a layout analysis model to segment each PDF page into categorized regions, including titles, text, figures, captions, tables, headers, footers, references, and equations. Each region is assigned coordinates and a reading order. Images of these identified regions are then processed by multimodal LLMs, enabling more precise content extraction compared to traditional text-only PDF parsers.

Quick Start & Requirements

  • Install via pip: pip install llmdocparser
  • Requires Poetry for installation from source.
  • Usage involves providing a PDF path, output directory, and LLM configuration (e.g., Azure or OpenAI API keys, endpoint, deployment).

Highlighted Details

  • Identifies 10 distinct content region types within PDFs.
  • Supports multiple LLM providers: Azure, OpenAI, and DashScope.
  • Provides cost estimation for LLM processing, with an example showing $0.32 for a 15-page document using GPT-4o.
  • Extracts content into a structured format suitable for RAG pipelines.

Maintenance & Community

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • The effectiveness of the layout analysis model and LLM extraction is dependent on the quality and complexity of the input PDFs.
  • Specific LLM API keys and configurations are required for operation.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
2 more.

MegaParse by QuivrHQ

0.3%
7k
File parser optimized for LLM ingestion
created 1 year ago
updated 5 months ago
Feedback? Help us improve.