finetune-embedding by run-llama

Embedding finetuning for RAG research paper

Created 2 years ago

524 stars

Top 60.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Bryan Helmig

Cofounder of Zapier

Jeff Hammerbacher

Cofounder of Cloudera

Jerry Liu

Cofounder of LlamaIndex

Project Summary

This repository demonstrates fine-tuning embedding models for Retrieval Augmented Generation (RAG) using synthetically generated data, targeting developers and researchers seeking to improve retrieval accuracy without labeled datasets. It provides a step-by-step guide to creating synthetic query-document pairs with an LLM, fine-tuning an open-source embedding model, and evaluating its performance.

How It Works

The core approach leverages an LLM to generate hypothetical questions answerable by specific text chunks, creating synthetic positive query-document pairs. This bypasses the need for human labeling. The fine-tuning process utilizes the sentence-transformers library with MultipleNegativesRankingLoss and InformationRetrievalEvaluator, starting with the "BAAI/bge-small-en" model for a few epochs. Evaluation compares the fine-tuned model against base and proprietary models using retrieval metrics.

Quick Start & Requirements

Install via pip install -r requirements.txt after cloning the repository and the llama_index repo.
Requires Python and Jupyter Notebooks.
Official documentation: https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/finetuning.html

Highlighted Details

Fine-tuning can substantially improve retrieval performance.
Uses synthetic data generation via LLM to create training pairs.
Employs MultipleNegativesRankingLoss for training.
Evaluates against base and OpenAI embedding models.

Maintenance & Community

This repository is technically outdated, with its functionality now integrated into the core LlamaIndex repo.

Licensing & Compatibility

The repository's license is not explicitly stated in the README.

Limitations & Caveats

The project is marked as technically outdated, with all embedding fine-tuning abstractions now present in the LlamaIndex repository. Users are directed to the LlamaIndex documentation for current guidance.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days