Microservice SDK for parsing unstructured documents into retrieval system inputs
Top 17.8% on sourcepulse
NVIDIA Ingest is a microservice-based SDK for extracting text, metadata, and structured content from a wide range of enterprise documents, including PDFs, Office documents, images, and audio. It is designed for large-scale, performance-oriented data ingestion into retrieval systems, targeting developers and researchers building generative AI applications.
How It Works
NV-Ingest leverages NVIDIA NIMs (self-hosted microservices) for specialized extraction tasks. It supports parallel document splitting, content classification (text, tables, charts, images), OCR, and outputs a structured JSON schema. The pipeline can optionally compute embeddings and store them in Milvus, offering flexibility in balancing throughput and accuracy with multiple extraction methods per document type.
Quick Start & Requirements
conda create -y --name nvingest python=3.10 && conda activate nvingest && conda install -y -c rapidsai -c conda-forge -c nvidia nv_ingest=25.3.0 nv_ingest_client=25.3.0 nv_ingest_api=25.3.0 && pip install opencv-python llama-index-embeddings-nvidia 'pymilvus==2.5.4' 'pymilvus[bulk_writer, model]' milvus-lite nvidia-riva-client unstructured-client
).NVIDIA_BUILD_API_KEY
, and NVIDIA_API_KEY
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
INSTALL_ADOBE_SDK
and reviewing its license.HF_ACCESS_TOKEN
and requesting access to gated models.1 day ago
1 day