finetune-embedding  by run-llama

Embedding finetuning for RAG research paper

Created 2 years ago
511 stars

Top 61.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository demonstrates fine-tuning embedding models for Retrieval Augmented Generation (RAG) using synthetically generated data, targeting developers and researchers seeking to improve retrieval accuracy without labeled datasets. It provides a step-by-step guide to creating synthetic query-document pairs with an LLM, fine-tuning an open-source embedding model, and evaluating its performance.

How It Works

The core approach leverages an LLM to generate hypothetical questions answerable by specific text chunks, creating synthetic positive query-document pairs. This bypasses the need for human labeling. The fine-tuning process utilizes the sentence-transformers library with MultipleNegativesRankingLoss and InformationRetrievalEvaluator, starting with the "BAAI/bge-small-en" model for a few epochs. Evaluation compares the fine-tuned model against base and proprietary models using retrieval metrics.

Quick Start & Requirements

Highlighted Details

  • Fine-tuning can substantially improve retrieval performance.
  • Uses synthetic data generation via LLM to create training pairs.
  • Employs MultipleNegativesRankingLoss for training.
  • Evaluates against base and OpenAI embedding models.

Maintenance & Community

This repository is technically outdated, with its functionality now integrated into the core LlamaIndex repo.

Licensing & Compatibility

The repository's license is not explicitly stated in the README.

Limitations & Caveats

The project is marked as technically outdated, with all embedding fine-tuning abstractions now present in the LlamaIndex repository. Users are directed to the LlamaIndex documentation for current guidance.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
23 more.

sentence-transformers by UKPLab

0.3%
18k
Framework for text embeddings, retrieval, and reranking
Created 6 years ago
Updated 3 days ago
Feedback? Help us improve.