nv-ingest  by NVIDIA

Microservice SDK for parsing unstructured documents into retrieval system inputs

created 11 months ago
2,726 stars

Top 17.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

NVIDIA Ingest is a microservice-based SDK for extracting text, metadata, and structured content from a wide range of enterprise documents, including PDFs, Office documents, images, and audio. It is designed for large-scale, performance-oriented data ingestion into retrieval systems, targeting developers and researchers building generative AI applications.

How It Works

NV-Ingest leverages NVIDIA NIMs (self-hosted microservices) for specialized extraction tasks. It supports parallel document splitting, content classification (text, tables, charts, images), OCR, and outputs a structured JSON schema. The pipeline can optionally compute embeddings and store them in Milvus, offering flexibility in balancing throughput and accuracy with multiple extraction methods per document type.

Quick Start & Requirements

  • Install: Use Conda (conda create -y --name nvingest python=3.10 && conda activate nvingest && conda install -y -c rapidsai -c conda-forge -c nvidia nv_ingest=25.3.0 nv_ingest_client=25.3.0 nv_ingest_api=25.3.0 && pip install opencv-python llama-index-embeddings-nvidia 'pymilvus==2.5.4' 'pymilvus[bulk_writer, model]' milvus-lite nvidia-riva-client unstructured-client).
  • Prerequisites: Linux (Ubuntu 22.04+ recommended), Conda, Python 3.10, NVIDIA_BUILD_API_KEY, and NVIDIA_API_KEY.
  • Setup: Library mode setup for <100 PDFs is demonstrated. Production deployment recommends Docker Compose or Kubernetes.
  • Docs: Official Documentation

Highlighted Details

  • Supports PDF, DOCX, PPTX, JPEG, PNG, SVG, TIFF, TXT file types.
  • Offers multiple extraction methods for PDFs (pdfium, nemoretriever-parse, Unstructured.io, Adobe).
  • Integrates with Milvus for vector storage and LlamaIndex/LangChain for retrieval pipelines.
  • Includes example Python code for ingestion, embedding, Milvus upload, and querying.

Maintenance & Community

  • Developed by NVIDIA.
  • Contribution requires signing off commits via Developer Certificate of Origin (DCO).

Licensing & Compatibility

  • Primarily uses a permissive license, but notes that third-party components (e.g., Adobe SDK, Llama tokenizer) may have separate license terms requiring review and potential access tokens.

Limitations & Caveats

  • This is an "early access" set of microservices.
  • GPU-accelerated indexing is not yet available in Milvus Lite.
  • Use of Adobe extraction requires enabling INSTALL_ADOBE_SDK and reviewing its license.
  • Use of Llama tokenizer requires setting HF_ACCESS_TOKEN and requesting access to gated models.
Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
63
Issues (30d)
3
Star History
82 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 22 hours ago
Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Zhiqiang Xie Zhiqiang Xie(Author of SGLang), and
7 more.

milvus by milvus-io

0.4%
36k
Cloud-native vector database for scalable ANN search
created 5 years ago
updated 1 day ago
Feedback? Help us improve.