llmsherpa  by nlmatics

Developer APIs for LLM project acceleration

created 1 year ago
1,695 stars

Top 25.6% on sourcepulse

GitHubView on GitHub
Project Summary

LLM Sherpa provides developer APIs to accelerate LLM projects by offering advanced PDF parsing capabilities. It addresses the challenge of extracting structured data and contextual information from PDFs, enabling developers to create more effective Retrieval Augmented Generation (RAG) systems. The library is designed for developers working with LLMs who need to process and understand PDF documents.

How It Works

LLM Sherpa's core component, LayoutPDFReader, parses PDFs to extract hierarchical layout information, including sections, paragraphs, tables, and lists, along with their relationships. This approach preserves document structure, unlike basic text extractors, allowing for smarter chunking of text that maintains context (e.g., associating table data with its surrounding section). This detailed parsing facilitates more accurate LLM interactions, especially for tasks requiring understanding of document flow and specific data elements.

Quick Start & Requirements

  • Install: pip install llmsherpa
  • Prerequisites: Python 3.x. For advanced usage with LLMs, llama-index and an LLM API key (e.g., OpenAI) are recommended.
  • Demo: Google Colab
  • Docs: llmsherpa.readthedocs.io

Highlighted Details

  • Preserves document structure: Extracts sections, paragraphs, tables, and lists with hierarchical context.
  • Smart chunking: Creates context-aware text chunks for improved LLM performance.
  • Table analysis: Enables LLM-based querying and summarization of tabular data.
  • Section extraction: Allows targeted LLM analysis of specific document sections.
  • Supports various file formats (DOCX, PPTX, HTML, TXT, XML) and includes OCR capabilities.

Maintenance & Community

The backend service is open-sourced under Apache 2.0 and can be self-hosted via Docker. The project links to a GitHub repository for the backend service (nlm-ingestor).

Licensing & Compatibility

The library itself is not explicitly licensed in the README, but the backend service is Apache 2.0. The README mentions using OpenAI for LLM integration, implying compatibility with commercial LLM providers.

Limitations & Caveats

The LayoutPDFReader is still challenging to get perfect for all PDFs, and currently only supports PDFs with a text layer (OCR is not supported). The free public API server mentioned will be decommissioned, requiring users to self-host.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
61 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.