StageRAG by darrencxl0301

Hallucination-resistant RAG framework

Created 1 month ago

473 stars

Top 64.5% on SourcePulse

Project Summary

Summary

StageRAG is a production-ready framework for building hallucination-resistant Retrieval Augmented Generation (RAG) systems. It offers precise control over the speed-versus-accuracy trade-off, enabling high-factuality applications by managing LLM response uncertainty. The target audience includes engineers and researchers developing robust RAG solutions.

How It Works

The framework employs dual-mode pipelines: a 3-step "Speed" mode using 1B and 3B models for rapid responses (~3-5s), and a 4-step "Precision" mode with a 3B model for deeper analysis (~6-12s). It integrates user knowledge bases via JSONL, automatically building vector indices. A novel multi-component confidence scoring system evaluates retrieval quality, answer structure, relevance, and uncertainty, allowing programmatic handling of low-confidence outputs to mitigate hallucinations. The system is optimized for smaller Llama 3.2 models with 4-bit quantization, requiring minimal GPU memory.

Quick Start & Requirements

Prerequisites: Access to Meta's Llama 3.2 1B and 3B models via HuggingFace (requires accepting gated model licenses) and HuggingFace CLI login (pip install huggingface-hub, huggingface-cli login).
System Requirements: Python >= 3.8, CUDA-capable GPU (recommended) or CPU, 5GB+ RAM (4-bit mode) / 10GB+ RAM (full precision), internet for model downloads.
Installation: Clone the repository, cd StageRAG, pip install -r requirements.txt, pip install -e ., python setup.py.
Dataset: Download sample data with python scripts/download_data.py or provide a custom JSONL file.
Usage: Run interactive demo with python demo/interactive_demo.py --rag_dataset data/data.jsonl [--use_4bit --device cuda]. Programmatic use via from stagerag import StageRAGSystem.

Highlighted Details

Dual-Mode Pipelines: Dynamically switch between 3-step (Speed) and 4-step (Precision) processing for tailored performance.
Hallucination Mitigation: Integrated confidence scoring and uncertainty detection for robust answer validation.
Resource Efficiency: Optimized for Llama 3.2 1B/3B models with 4-bit quantization, requiring only 5-10GB GPU memory.
Performance: Benchmarked at ~3.3s (Speed) and ~7.8s (Precision) average response times on an NVIDIA RTX 3090 with 4-bit quantization.

Maintenance & Community

Contributions are welcomed via standard GitHub pull requests. The primary contact is Darren Chai Xin Lun (@darrencxl0301 on GitHub and HuggingFace). Key dependencies include Llama 3.2 models, FAISS, and Sentence Transformers.

Licensing & Compatibility

This project is released under the MIT License, which is permissive for commercial use and integration into closed-source applications.

Limitations & Caveats

Users must obtain explicit access approval for the gated Llama 3.2 models from HuggingFace, which is a mandatory prerequisite. While CPU support exists, a CUDA-capable GPU is recommended for optimal performance.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

425 stars in the last 30 days