finetune-embedding  by run-llama

Embedding finetuning for RAG research paper

created 1 year ago
504 stars

Top 62.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository demonstrates fine-tuning embedding models for Retrieval Augmented Generation (RAG) using synthetically generated data, targeting developers and researchers seeking to improve retrieval accuracy without labeled datasets. It provides a step-by-step guide to creating synthetic query-document pairs with an LLM, fine-tuning an open-source embedding model, and evaluating its performance.

How It Works

The core approach leverages an LLM to generate hypothetical questions answerable by specific text chunks, creating synthetic positive query-document pairs. This bypasses the need for human labeling. The fine-tuning process utilizes the sentence-transformers library with MultipleNegativesRankingLoss and InformationRetrievalEvaluator, starting with the "BAAI/bge-small-en" model for a few epochs. Evaluation compares the fine-tuned model against base and proprietary models using retrieval metrics.

Quick Start & Requirements

Highlighted Details

  • Fine-tuning can substantially improve retrieval performance.
  • Uses synthetic data generation via LLM to create training pairs.
  • Employs MultipleNegativesRankingLoss for training.
  • Evaluates against base and OpenAI embedding models.

Maintenance & Community

This repository is technically outdated, with its functionality now integrated into the core LlamaIndex repo.

Licensing & Compatibility

The repository's license is not explicitly stated in the README.

Limitations & Caveats

The project is marked as technically outdated, with all embedding fine-tuning abstractions now present in the LlamaIndex repository. Users are directed to the LlamaIndex documentation for current guidance.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.