rag  by NVIDIA-AI-Blueprints

GPU-accelerated RAG pipeline for enterprise data

Created 9 months ago
259 stars

Top 97.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

NVIDIA-AI-Blueprints/rag provides a reference solution for building foundational Retrieval Augmented Generation (RAG) pipelines. It targets developers seeking a quick, production-ready RAG setup leveraging NVIDIA NIM microservices and GPU acceleration. The blueprint enables querying enterprise data, offering benefits like enhanced data governance, reduced latency, and multimodal data processing.

How It Works

This blueprint implements a modular RAG architecture orchestrated by a LangChain-based server. It utilizes NVIDIA NIM microservices for core functions: response generation (LLM inference), embedding retrieval, and document parsing/extraction. Data is stored in a Milvus Vector Database, accelerated with NVIDIA cuVS. The workflow involves query processing, retrieval of relevant document chunks, optional reranking for precision, and response generation using the retrieved context. Key advantages include GPU-accelerated indexing and search, multimodal data ingestion, and optional integration of vision language models (VLMs) and guardrails.

Quick Start & Requirements

Deployment options include Docker Compose for single-node setups, direct integration with NVIDIA AI Workbench, or Helm charts for scalable deployments. Sample Jupyter notebooks are provided for interaction. Prerequisites include NVIDIA NIM microservices and compatible NVIDIA GPUs. Specific models like nvidia/llama-3.3-nemotron-super-49b-v1 and nvidia/llama-3_2-nv-embedqa-1b-v2 are leveraged. Official quick-start guides, API specifications, and usage notebooks are available via links in the documentation.

Highlighted Details

  • Supports multimodal PDF data extraction, including text, tables, charts, and infographics.
  • Features hybrid search combining dense and sparse retrieval methods.
  • Offers optional VLM integration for answer generation and image captioning.
  • Provides GPU-accelerated index creation and search capabilities.
  • Includes optional NeMo Guardrails for input/output content safety and topic control.
  • Exposes OpenAI-compatible APIs and a sample UI (rag-playground).

Maintenance & Community

The project is hosted on GitHub, encouraging community contributions through issues and pull requests to support the NVIDIA LLM ecosystem and gather feedback.

Licensing & Compatibility

The blueprint itself is licensed under the Apache License, Version 2.0. Use of the integrated models is governed by specific NVIDIA licenses, including the NVIDIA AI Foundation Models Community License and the Llama 3.2 Community License Agreement for certain models. Compatibility requires NVIDIA hardware and software stack.

Limitations & Caveats

Advanced features such as self-reflection, query rewriting, image captioning, NeMo Guardrails, VLM inferencing, and PDF extraction with Nemoretriever Parse are not supported on B200 GPUs; H100 or A100 GPUs are recommended for these functionalities. Image captioning is disabled by default to optimize latency, potentially affecting accuracy for image-related queries.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
4
Star History
39 stars in the last 30 days

Explore Similar Projects

Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
Created 3 years ago
Updated 3 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

towhee by towhee-io

0.0%
3k
Framework for neural data processing pipelines
Created 4 years ago
Updated 11 months ago
Feedback? Help us improve.