Embedding finetuning for RAG research paper
Top 62.6% on sourcepulse
This repository demonstrates fine-tuning embedding models for Retrieval Augmented Generation (RAG) using synthetically generated data, targeting developers and researchers seeking to improve retrieval accuracy without labeled datasets. It provides a step-by-step guide to creating synthetic query-document pairs with an LLM, fine-tuning an open-source embedding model, and evaluating its performance.
How It Works
The core approach leverages an LLM to generate hypothetical questions answerable by specific text chunks, creating synthetic positive query-document pairs. This bypasses the need for human labeling. The fine-tuning process utilizes the sentence-transformers
library with MultipleNegativesRankingLoss
and InformationRetrievalEvaluator
, starting with the "BAAI/bge-small-en" model for a few epochs. Evaluation compares the fine-tuned model against base and proprietary models using retrieval metrics.
Quick Start & Requirements
pip install -r requirements.txt
after cloning the repository and the llama_index
repo.Highlighted Details
MultipleNegativesRankingLoss
for training.Maintenance & Community
This repository is technically outdated, with its functionality now integrated into the core LlamaIndex repo.
Licensing & Compatibility
The repository's license is not explicitly stated in the README.
Limitations & Caveats
The project is marked as technically outdated, with all embedding fine-tuning abstractions now present in the LlamaIndex repository. Users are directed to the LlamaIndex documentation for current guidance.
1 year ago
Inactive