rag-from-scratch  by langchain-ai

RAG tutorial for expanding LLM knowledge via external data

Created 1 year ago
5,497 stars

Top 9.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a hands-on, step-by-step guide to building Retrieval Augmented Generation (RAG) systems from scratch, aimed at developers and researchers seeking to enhance LLMs with external, up-to-date, or private data. It offers a foundational understanding of RAG's core components: indexing, retrieval, and generation, enabling LLMs to access and utilize information beyond their training data.

How It Works

The project breaks down RAG into its fundamental stages, demonstrating how to ingest documents, create searchable indexes (likely using vector embeddings), retrieve relevant information based on user queries, and then integrate this retrieved context into prompts for an LLM to generate informed responses. This approach allows LLMs to ground their outputs in specific, external knowledge, improving factual accuracy and relevance without costly fine-tuning.

Quick Start & Requirements

The project consists of Jupyter notebooks. Running these notebooks requires Python 3.8+ and standard data science libraries (e.g., numpy, pandas, torch, transformers, langchain). Specific dependencies will be detailed within the notebooks themselves. Links to the accompanying video playlist and detailed notebook instructions are available in the repository.

Highlighted Details

  • Focuses on building RAG from fundamental principles, avoiding high-level abstractions initially.
  • Covers the end-to-end pipeline: data loading, chunking, embedding, indexing, retrieval, and generation.
  • Aims to demystify the inner workings of RAG systems.

Maintenance & Community

This repository is associated with LangChain AI. Further community engagement and support can likely be found through LangChain's official channels, such as their Discord server or GitHub discussions.

Licensing & Compatibility

The repository is licensed under the MIT License, which permits commercial use and modification.

Limitations & Caveats

As a "from scratch" educational resource, the provided code may not be optimized for production-level performance or scalability. Users will need to adapt and integrate components into robust production systems. The specific LLMs and embedding models used will require separate setup and potentially API keys.

Health Check
Last Commit

2 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
214 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research) and Andre Zayarni Andre Zayarni(Cofounder of Qdrant).

kernel-memory by microsoft

0.2%
2k
RAG architecture for indexing and querying data using LLMs
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.