nv-ingest by NVIDIA

Microservice SDK for parsing unstructured documents into retrieval system inputs

Created 1 year ago

2,799 stars

Top 16.9% on SourcePulse

1 Expert Loves This Project

hammer

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

NVIDIA Ingest is a microservice-based SDK for extracting text, metadata, and structured content from a wide range of enterprise documents, including PDFs, Office documents, images, and audio. It is designed for large-scale, performance-oriented data ingestion into retrieval systems, targeting developers and researchers building generative AI applications.

How It Works

NV-Ingest leverages NVIDIA NIMs (self-hosted microservices) for specialized extraction tasks. It supports parallel document splitting, content classification (text, tables, charts, images), OCR, and outputs a structured JSON schema. The pipeline can optionally compute embeddings and store them in Milvus, offering flexibility in balancing throughput and accuracy with multiple extraction methods per document type.

Quick Start & Requirements

Install: Use Conda (conda create -y --name nvingest python=3.10 && conda activate nvingest && conda install -y -c rapidsai -c conda-forge -c nvidia nv_ingest=25.3.0 nv_ingest_client=25.3.0 nv_ingest_api=25.3.0 && pip install opencv-python llama-index-embeddings-nvidia 'pymilvus==2.5.4' 'pymilvus[bulk_writer, model]' milvus-lite nvidia-riva-client unstructured-client).
Prerequisites: Linux (Ubuntu 22.04+ recommended), Conda, Python 3.10, NVIDIA_BUILD_API_KEY, and NVIDIA_API_KEY.
Setup: Library mode setup for <100 PDFs is demonstrated. Production deployment recommends Docker Compose or Kubernetes.
Docs: Official Documentation

Highlighted Details

Supports PDF, DOCX, PPTX, JPEG, PNG, SVG, TIFF, TXT file types.
Offers multiple extraction methods for PDFs (pdfium, nemoretriever-parse, Unstructured.io, Adobe).
Integrates with Milvus for vector storage and LlamaIndex/LangChain for retrieval pipelines.
Includes example Python code for ingestion, embedding, Milvus upload, and querying.

Maintenance & Community

Developed by NVIDIA.
Contribution requires signing off commits via Developer Certificate of Origin (DCO).

Licensing & Compatibility

Primarily uses a permissive license, but notes that third-party components (e.g., Adobe SDK, Llama tokenizer) may have separate license terms requiring review and potential access tokens.

Limitations & Caveats

This is an "early access" set of microservices.
GPU-accelerated indexing is not yet available in Milvus Lite.
Use of Adobe extraction requires enabling INSTALL_ADOBE_SDK and reviewing its license.
Use of Llama tokenizer requires setting HF_ACCESS_TOKEN and requesting access to gated models.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

89

Issues (30d)

0

Star History

24 stars in the last 30 days

Explore Similar Projects

tiny-rag by wdndev

Tiny RAG system for retrieval-augmented LLM

Created 1 year ago

Updated 8 months ago

SmartResume by alibaba

AI-powered resume parsing system

Created 2 months ago

Updated 2 months ago

Versatile-OCR-Program by ses4255

OCR pipeline for ML training datasets from documents

Created 9 months ago

Updated 7 months ago

spacy-layout by explosion

spaCy plugin for structured PDF/document processing

Created 1 year ago

Updated 10 months ago

DeepSeek-OCR-WebUI by neosun100

Intelligent OCR web application for diverse document and image analysis

Created 2 months ago

Updated 3 weeks ago

Starred by

Dharmesh Shah

Dharmesh Shah(Cofounder of HubSpot).

thepipe by emcf

SDK for extracting data from documents

Created 1 year ago

Updated 2 months ago

docstrange by NanoNets

Extract and convert data from any document to multiple formats

Created 5 months ago

Updated 2 months ago

Starred by

Rodrigo Nader

Rodrigo Nader(Cofounder of Langflow).

ColiVara by tjmlabs

Document retrieval API using visual embeddings for enhanced RAG

Created 1 year ago

Updated 8 months ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai) and

Tim Suchanek

Tim Suchanek(Founder of expand.ai).

nlm-ingestor by nlmatics

Server for LLM ingestion via API, enabling custom RAG parsing

Created 2 years ago

Updated 9 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

WeKnora by Tencent

LLM framework for deep document understanding and RAG

Created 5 months ago

Updated 2 days ago

Starred by

Jeffrey Morgan

Jeffrey Morgan(Cofounder of Ollama),

Dan Guido

Dan Guido(Cofounder of Trail of Bits), and

2 more.

langextract by google

Extract structured data from text with LLMs

Created 6 months ago

Updated 1 week ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Rodrigo Nader

Rodrigo Nader(Cofounder of Langflow), and

9 more.

ragflow by infiniflow

Open-source RAG engine for deep document understanding

Created 2 years ago

Updated 1 day ago

Feedback? Help us improve.