rag-from-scratch by langchain-ai

RAG tutorial for expanding LLM knowledge via external data

Created 1 year ago

6,187 stars

Top 8.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elvis Saravia

Founder of DAIR.AI

Project Summary

This repository provides a hands-on, step-by-step guide to building Retrieval Augmented Generation (RAG) systems from scratch, aimed at developers and researchers seeking to enhance LLMs with external, up-to-date, or private data. It offers a foundational understanding of RAG's core components: indexing, retrieval, and generation, enabling LLMs to access and utilize information beyond their training data.

How It Works

The project breaks down RAG into its fundamental stages, demonstrating how to ingest documents, create searchable indexes (likely using vector embeddings), retrieve relevant information based on user queries, and then integrate this retrieved context into prompts for an LLM to generate informed responses. This approach allows LLMs to ground their outputs in specific, external knowledge, improving factual accuracy and relevance without costly fine-tuning.

Quick Start & Requirements

The project consists of Jupyter notebooks. Running these notebooks requires Python 3.8+ and standard data science libraries (e.g., numpy, pandas, torch, transformers, langchain). Specific dependencies will be detailed within the notebooks themselves. Links to the accompanying video playlist and detailed notebook instructions are available in the repository.

Highlighted Details

Focuses on building RAG from fundamental principles, avoiding high-level abstractions initially.
Covers the end-to-end pipeline: data loading, chunking, embedding, indexing, retrieval, and generation.
Aims to demystify the inner workings of RAG systems.

Maintenance & Community

This repository is associated with LangChain AI. Further community engagement and support can likely be found through LangChain's official channels, such as their Discord server or GitHub discussions.

Licensing & Compatibility

The repository is licensed under the MIT License, which permits commercial use and modification.

Limitations & Caveats

As a "from scratch" educational resource, the provided code may not be optimized for production-level performance or scalability. Users will need to adapt and integrate components into robust production systems. The specific LLMs and embedding models used will require separate setup and potentially API keys.

Health Check

Last Commit

6 months ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

158 stars in the last 30 days