EpsteinFiles-RAG  by AnkitNayak-eth

RAG pipeline for grounded question answering

Created 2 weeks ago

New!

330 stars

Top 83.2% on SourcePulse

GitHubView on GitHub
Project Summary

This project implements a Retrieval Augmented Generation (RAG) pipeline specifically for the 'Epstein Files 20K' dataset. It addresses the need for accurate, grounded answers derived solely from source documents, mitigating hallucinations common in LLM applications. The pipeline is designed for researchers, engineers, and power users seeking to query and understand large, complex datasets with high fidelity and speed. Its key benefit is providing verifiable, contextually rich answers with minimal setup.

How It Works

The RAG pipeline operates in three stages: Data Preparation involves cleaning fragmented documents, intelligent semantic chunking, and embedding into a vector database. Stage two, Intelligent Retrieval, uses Maximal Marginal Relevance (MMR) to select diverse yet relevant document chunks, overcoming the redundancy of pure similarity searches. Finally, Stage three, Grounded Answer Generation, feeds the retrieved context and user query to a LLaMA 3.3 LLM to produce an answer strictly based on the provided sources. This MMR approach ensures comprehensive context retrieval, leading to more robust and informative responses.

Quick Start & Requirements

  • Primary Install: pip install -r requirements.txt after cloning the repository.
  • Prerequisites: Python 3.11+, 16GB RAM (8GB minimum), Groq API key.
  • Setup Time: Approximately 5 minutes for core setup, excluding the initial data ingestion pipeline (which can take 30-75 minutes total).
  • Links:
    • Dataset: https://huggingface.co/datasets/teyler/epstein-files-20k
    • Precomputed Embeddings: https://huggingface.co/datasets/devankit7873/EpsteinFiles-Vector-Embeddings-ChromaDB
    • Groq API: https://console.groq.com

Highlighted Details

  • No Hallucinations: Answers are strictly grounded in retrieved source documents.
  • Intelligent Retrieval: Employs MMR for diverse and relevant context selection.
  • Fast Processing: Achieves ~1 second end-to-end query response time.
  • Semantic Understanding: Utilizes context-aware document chunking.
  • Production-Ready: Features a REST API (FastAPI), interactive Streamlit UI, async support, error handling, and logging.
  • Scalability: Designed to handle 100K+ document chunks.

Maintenance & Community

The project is built by Ankit Kumar Nayak. Support and inquiries can be directed via GitHub Issues or Discussions. No specific community channels like Discord or Slack are mentioned.

Licensing & Compatibility

Licensed under the MIT License, permitting broad use, modification, and distribution, including for commercial purposes.

Limitations & Caveats

The initial data ingestion process, particularly embedding generation, can be time-consuming (up to 45 minutes). A Groq API key is a mandatory requirement for LLM inference. The project is explicitly stated to be for research, transparency, and educational purposes.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
334 stars in the last 15 days

Explore Similar Projects

Feedback? Help us improve.