nlm-ingestor by nlmatics

Server for LLM ingestion via API, enabling custom RAG parsing

Created 2 years ago

1,277 stars

Top 31.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

This repository provides the server-side code for the llmsherpa API, offering custom RAG-friendly parsers for various document formats. It's designed for developers building LLM applications who need to ingest and process diverse document types efficiently, particularly PDFs, with advanced structural understanding.

How It Works

The core of nlm-ingestor is a modified Apache Tika server. For PDFs, it employs a rule-based parser leveraging text coordinates, font data, and graphics information from a custom nlm-tika fork. This approach prioritizes speed and minimal hardware requirements over vision-based methods, offering features like section extraction, table identification, list parsing, and watermark removal. An optional OCR layer using Tesseract is available for scanned documents. Other formats like HTML, DOCX, and PPTX are parsed either directly or via Tika's HTML output, with a special HTML parser designed for layout-aware chunking.

Quick Start & Requirements

Installation: Run the Tika server using the provided JAR (java -jar <path_to_nlm-ingestor>/jars/tika-server-standard-nlm-modified-2.9.2_v2.jar), then install the ingestor (pip install nlm-ingestor). Alternatively, use the provided Docker image (docker pull ghcr.io/nlmatics/nlm-ingestor:latest and docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest-<version>).
Prerequisites: Java (latest version recommended), nlm-tika (included as a JAR). OCR requires Tesseract.
API Endpoint: http://localhost:5010/api/parseDocument?renderFormat=all (with optional parameters like applyOcr=yes).
Resources: Minimal hardware needed unless using OCR.
Documentation: Notebooks for direct experimentation are available in the repository.

Highlighted Details

Rule-based PDF parser is significantly faster (100x) than vision-based alternatives for text-heavy documents.
Supports OCR for scanned PDFs with boundary box information.
Custom HTML parser creates layout-aware blocks for improved RAG performance.
Handles document structure like sections, tables, lists, and cross-page content.
Includes features for header/footer removal and watermark elimination.

Maintenance & Community

The project highlights contributions from multiple individuals, with specific mentions of Ambika Sukla, Reshav Abraham, Tom Liu, and Kiran Panicker for core parsing components. The project relies on Apache Tika and PDFBox.

Licensing & Compatibility

The README does not explicitly state a license. Apache Tika and PDFBox are typically distributed under Apache 2.0. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on a modified version of Tika, and users may need to recompile the JAR if encountering Java server errors with specific PDFs. Future work suggestions include making changes independent of Tika and upgrading to newer Tika versions.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days