RAG-Retrieval  by NovaSearch-Team

End-to-end code for RAG retrieval model training, inference, and distillation

created 1 year ago
985 stars

Top 38.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an end-to-end solution for training, inference, and distillation of Retrieval-Augmented Generation (RAG) retrieval models, including embedding, ColBERT, and reranker components. It targets researchers and developers working with RAG systems, offering unified code and support for various open-source models, with a focus on efficient fine-tuning and distillation from large to small models.

How It Works

The framework supports fine-tuning of diverse RAG retrieval models: embedding models (BERT-based, LLM-based), late interaction models (ColBERT), and reranker models (BERT-based, LLM-based). It leverages advanced algorithms like MRL loss for dimensionality reduction in embedding models and supports multi-GPU training strategies via DeepSpeed and FSDP. For inference, a lightweight Python library rag-retrieval offers a unified interface for various reranker models, including specific logic for handling long documents.

Quick Start & Requirements

  • Training: conda create -n rag-retrieval python=3.8, activate environment, pip install -r requirements.txt. Manual PyTorch/CUDA version installation recommended.
  • Inference (reranker): pip install rag-retrieval. Manual PyTorch/CUDA version installation recommended.
  • Prerequisites: Python 3.8+, CUDA (version compatibility recommended).
  • Docs: Tutorial

Highlighted Details

  • Supports distillation of LLM-based rerankers to BERT-based models.
  • Achieves competitive performance on MTEB Reranking tasks, with a custom rag-retrieval-reranker model showing strong results.
  • Implements LLM preference-based supervised fine-tuning for RAG retrievers.
  • Includes implementation of MRL loss for embedding models.

Maintenance & Community

  • Recent updates include core training code for Stella and Jasper embedding models (distillation of SOTA models), LLM-based reranker methods, and MRL loss implementation.
  • Community engagement via WeChat group.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README notes that performance improvements from fine-tuning open-source models with existing general datasets might be limited, suggesting vertical field datasets yield greater gains.

Health Check
Last commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
5
Star History
167 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.