llmsherpa by nlmatics

Developer APIs for LLM project acceleration

Created 2 years ago

1,741 stars

Top 24.4% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Rodrigo Nader

Cofounder of Langflow

Xiaofan Luan

VP Engineering at Zilliz

Jerry Liu

Cofounder of LlamaIndex

Project Summary

LLM Sherpa provides developer APIs to accelerate LLM projects by offering advanced PDF parsing capabilities. It addresses the challenge of extracting structured data and contextual information from PDFs, enabling developers to create more effective Retrieval Augmented Generation (RAG) systems. The library is designed for developers working with LLMs who need to process and understand PDF documents.

How It Works

LLM Sherpa's core component, LayoutPDFReader, parses PDFs to extract hierarchical layout information, including sections, paragraphs, tables, and lists, along with their relationships. This approach preserves document structure, unlike basic text extractors, allowing for smarter chunking of text that maintains context (e.g., associating table data with its surrounding section). This detailed parsing facilitates more accurate LLM interactions, especially for tasks requiring understanding of document flow and specific data elements.

Quick Start & Requirements

Install: pip install llmsherpa
Prerequisites: Python 3.x. For advanced usage with LLMs, llama-index and an LLM API key (e.g., OpenAI) are recommended.
Demo: Google Colab
Docs: llmsherpa.readthedocs.io

Highlighted Details

Preserves document structure: Extracts sections, paragraphs, tables, and lists with hierarchical context.
Smart chunking: Creates context-aware text chunks for improved LLM performance.
Table analysis: Enables LLM-based querying and summarization of tabular data.
Section extraction: Allows targeted LLM analysis of specific document sections.
Supports various file formats (DOCX, PPTX, HTML, TXT, XML) and includes OCR capabilities.

Maintenance & Community

The backend service is open-sourced under Apache 2.0 and can be self-hosted via Docker. The project links to a GitHub repository for the backend service (nlm-ingestor).

Licensing & Compatibility

The library itself is not explicitly licensed in the README, but the backend service is Apache 2.0. The README mentions using OpenAI for LLM integration, implying compatibility with commercial LLM providers.

Limitations & Caveats

The LayoutPDFReader is still challenging to get perfect for all PDFs, and currently only supports PDFs with a text layer (OCR is not supported). The free public API server mentioned will be decommissioned, requiring users to self-host.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days