Server for LLM ingestion via API, enabling custom RAG parsing
Top 32.1% on sourcepulse
This repository provides the server-side code for the llmsherpa API, offering custom RAG-friendly parsers for various document formats. It's designed for developers building LLM applications who need to ingest and process diverse document types efficiently, particularly PDFs, with advanced structural understanding.
How It Works
The core of nlm-ingestor is a modified Apache Tika server. For PDFs, it employs a rule-based parser leveraging text coordinates, font data, and graphics information from a custom nlm-tika
fork. This approach prioritizes speed and minimal hardware requirements over vision-based methods, offering features like section extraction, table identification, list parsing, and watermark removal. An optional OCR layer using Tesseract is available for scanned documents. Other formats like HTML, DOCX, and PPTX are parsed either directly or via Tika's HTML output, with a special HTML parser designed for layout-aware chunking.
Quick Start & Requirements
java -jar <path_to_nlm-ingestor>/jars/tika-server-standard-nlm-modified-2.9.2_v2.jar
), then install the ingestor (pip install nlm-ingestor
). Alternatively, use the provided Docker image (docker pull ghcr.io/nlmatics/nlm-ingestor:latest
and docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest-<version>
).nlm-tika
(included as a JAR). OCR requires Tesseract.http://localhost:5010/api/parseDocument?renderFormat=all
(with optional parameters like applyOcr=yes
).Highlighted Details
Maintenance & Community
The project highlights contributions from multiple individuals, with specific mentions of Ambika Sukla, Reshav Abraham, Tom Liu, and Kiran Panicker for core parsing components. The project relies on Apache Tika and PDFBox.
Licensing & Compatibility
The README does not explicitly state a license. Apache Tika and PDFBox are typically distributed under Apache 2.0. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project relies on a modified version of Tika, and users may need to recompile the JAR if encountering Java server errors with specific PDFs. Future work suggestions include making changes independent of Tika and upgrading to newer Tika versions.
4 months ago
1 week