EpsteinFiles-RAG by AnkitNayak-dev

RAG pipeline for grounded question answering

Created 5 months ago

378 stars

Top 74.9% on SourcePulse

Project Summary

This project implements a Retrieval Augmented Generation (RAG) pipeline specifically for the 'Epstein Files 20K' dataset. It addresses the need for accurate, grounded answers derived solely from source documents, mitigating hallucinations common in LLM applications. The pipeline is designed for researchers, engineers, and power users seeking to query and understand large, complex datasets with high fidelity and speed. Its key benefit is providing verifiable, contextually rich answers with minimal setup.

How It Works

The RAG pipeline operates in three stages: Data Preparation involves cleaning fragmented documents, intelligent semantic chunking, and embedding into a vector database. Stage two, Intelligent Retrieval, uses Maximal Marginal Relevance (MMR) to select diverse yet relevant document chunks, overcoming the redundancy of pure similarity searches. Finally, Stage three, Grounded Answer Generation, feeds the retrieved context and user query to a LLaMA 3.3 LLM to produce an answer strictly based on the provided sources. This MMR approach ensures comprehensive context retrieval, leading to more robust and informative responses.

Quick Start & Requirements

Primary Install: pip install -r requirements.txt after cloning the repository.
Prerequisites: Python 3.11+, 16GB RAM (8GB minimum), Groq API key.
Setup Time: Approximately 5 minutes for core setup, excluding the initial data ingestion pipeline (which can take 30-75 minutes total).
Links:
- Dataset: https://huggingface.co/datasets/teyler/epstein-files-20k
- Precomputed Embeddings: https://huggingface.co/datasets/devankit7873/EpsteinFiles-Vector-Embeddings-ChromaDB
- Groq API: https://console.groq.com

Highlighted Details

No Hallucinations: Answers are strictly grounded in retrieved source documents.
Intelligent Retrieval: Employs MMR for diverse and relevant context selection.
Fast Processing: Achieves ~1 second end-to-end query response time.
Semantic Understanding: Utilizes context-aware document chunking.
Production-Ready: Features a REST API (FastAPI), interactive Streamlit UI, async support, error handling, and logging.
Scalability: Designed to handle 100K+ document chunks.

Maintenance & Community

The project is built by Ankit Kumar Nayak. Support and inquiries can be directed via GitHub Issues or Discussions. No specific community channels like Discord or Slack are mentioned.

Licensing & Compatibility

Licensed under the MIT License, permitting broad use, modification, and distribution, including for commercial purposes.

Limitations & Caveats

The initial data ingestion process, particularly embedding generation, can be time-consuming (up to 45 minutes). A Groq API key is a mandatory requirement for LLM inference. The project is explicitly stated to be for research, transparency, and educational purposes.

EpsteinFiles-RAG by AnkitNayak-dev

Explore Similar Projects

yacy_expert by yacy

rag by neuml

arag by Ayanami0730

rag-demystified by pchunduri6

embedJs by llm-tools

Chat_with_Datawhale_langchain by logan-zou

RAG-Interview-Questions-and-Answers-Hub by KalyanKS-NLP

raptor by parthsarthi03

Chinese-LangChain by yanqiangmiffy

ColBERT by stanford-futuredata

RAG-Challenge-2 by IlyaRice

WeKnora by Tencent