llmdocparser by lazyFrogLOL

SDK for parsing PDFs and analyzing content using LLMs

Created 1 year ago

270 stars

Top 95.4% on SourcePulse

Project Summary

This package parses PDF documents by identifying distinct content regions (text, figures, tables, etc.) using a layout analysis model. It then feeds images of these regions to multimodal LLMs like GPT-4o or Qwen-VL to extract structured text, making it suitable for RAG applications.

How It Works

The core approach leverages a layout analysis model to segment each PDF page into categorized regions, including titles, text, figures, captions, tables, headers, footers, references, and equations. Each region is assigned coordinates and a reading order. Images of these identified regions are then processed by multimodal LLMs, enabling more precise content extraction compared to traditional text-only PDF parsers.

Quick Start & Requirements

Install via pip: pip install llmdocparser
Requires Poetry for installation from source.
Usage involves providing a PDF path, output directory, and LLM configuration (e.g., Azure or OpenAI API keys, endpoint, deployment).

Highlighted Details

Identifies 10 distinct content region types within PDFs.
Supports multiple LLM providers: Azure, OpenAI, and DashScope.
Provides cost estimation for LLM processing, with an example showing $0.32 for a 15-page document using GPT-4o.
Extracts content into a structured format suitable for RAG pipelines.

Maintenance & Community

Project is hosted on GitHub: https://github.com/lazyFrogLOL/llmdocparser
Star history is available.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

The effectiveness of the layout analysis model and LLM extraction is dependent on the quality and complexity of the input PDFs.
Specific LLM API keys and configurations are required for operation.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days