nv-ingest  by NVIDIA

Microservice SDK for parsing unstructured documents into retrieval system inputs

Created 1 year ago
2,745 stars

Top 17.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

NVIDIA Ingest is a microservice-based SDK for extracting text, metadata, and structured content from a wide range of enterprise documents, including PDFs, Office documents, images, and audio. It is designed for large-scale, performance-oriented data ingestion into retrieval systems, targeting developers and researchers building generative AI applications.

How It Works

NV-Ingest leverages NVIDIA NIMs (self-hosted microservices) for specialized extraction tasks. It supports parallel document splitting, content classification (text, tables, charts, images), OCR, and outputs a structured JSON schema. The pipeline can optionally compute embeddings and store them in Milvus, offering flexibility in balancing throughput and accuracy with multiple extraction methods per document type.

Quick Start & Requirements

  • Install: Use Conda (conda create -y --name nvingest python=3.10 && conda activate nvingest && conda install -y -c rapidsai -c conda-forge -c nvidia nv_ingest=25.3.0 nv_ingest_client=25.3.0 nv_ingest_api=25.3.0 && pip install opencv-python llama-index-embeddings-nvidia 'pymilvus==2.5.4' 'pymilvus[bulk_writer, model]' milvus-lite nvidia-riva-client unstructured-client).
  • Prerequisites: Linux (Ubuntu 22.04+ recommended), Conda, Python 3.10, NVIDIA_BUILD_API_KEY, and NVIDIA_API_KEY.
  • Setup: Library mode setup for <100 PDFs is demonstrated. Production deployment recommends Docker Compose or Kubernetes.
  • Docs: Official Documentation

Highlighted Details

  • Supports PDF, DOCX, PPTX, JPEG, PNG, SVG, TIFF, TXT file types.
  • Offers multiple extraction methods for PDFs (pdfium, nemoretriever-parse, Unstructured.io, Adobe).
  • Integrates with Milvus for vector storage and LlamaIndex/LangChain for retrieval pipelines.
  • Includes example Python code for ingestion, embedding, Milvus upload, and querying.

Maintenance & Community

  • Developed by NVIDIA.
  • Contribution requires signing off commits via Developer Certificate of Origin (DCO).

Licensing & Compatibility

  • Primarily uses a permissive license, but notes that third-party components (e.g., Adobe SDK, Llama tokenizer) may have separate license terms requiring review and potential access tokens.

Limitations & Caveats

  • This is an "early access" set of microservices.
  • GPU-accelerated indexing is not yet available in Milvus Lite.
  • Use of Adobe extraction requires enabling INSTALL_ADOBE_SDK and reviewing its license.
  • Use of Llama tokenizer requires setting HF_ACCESS_TOKEN and requesting access to gated models.
Health Check
Last Commit

22 hours ago

Responsiveness

1 day

Pull Requests (30d)
60
Issues (30d)
1
Star History
16 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.