Discover and explore top open-source AI tools and projects—updated daily.
AnkitNayak-ethRAG pipeline for grounded question answering
New!
Top 83.2% on SourcePulse
This project implements a Retrieval Augmented Generation (RAG) pipeline specifically for the 'Epstein Files 20K' dataset. It addresses the need for accurate, grounded answers derived solely from source documents, mitigating hallucinations common in LLM applications. The pipeline is designed for researchers, engineers, and power users seeking to query and understand large, complex datasets with high fidelity and speed. Its key benefit is providing verifiable, contextually rich answers with minimal setup.
How It Works
The RAG pipeline operates in three stages: Data Preparation involves cleaning fragmented documents, intelligent semantic chunking, and embedding into a vector database. Stage two, Intelligent Retrieval, uses Maximal Marginal Relevance (MMR) to select diverse yet relevant document chunks, overcoming the redundancy of pure similarity searches. Finally, Stage three, Grounded Answer Generation, feeds the retrieved context and user query to a LLaMA 3.3 LLM to produce an answer strictly based on the provided sources. This MMR approach ensures comprehensive context retrieval, leading to more robust and informative responses.
Quick Start & Requirements
pip install -r requirements.txt after cloning the repository.https://huggingface.co/datasets/teyler/epstein-files-20khttps://huggingface.co/datasets/devankit7873/EpsteinFiles-Vector-Embeddings-ChromaDBhttps://console.groq.comHighlighted Details
Maintenance & Community
The project is built by Ankit Kumar Nayak. Support and inquiries can be directed via GitHub Issues or Discussions. No specific community channels like Discord or Slack are mentioned.
Licensing & Compatibility
Licensed under the MIT License, permitting broad use, modification, and distribution, including for commercial purposes.
Limitations & Caveats
The initial data ingestion process, particularly embedding generation, can be time-consuming (up to 45 minutes). A Groq API key is a mandatory requirement for LLM inference. The project is explicitly stated to be for research, transparency, and educational purposes.
1 week ago
Inactive
stanford-futuredata