llm2vec  by McGill-NLP

Text encoder recipe using decoder-only LLMs

created 1 year ago
1,564 stars

Top 27.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides LLM2Vec, a method to transform decoder-only Large Language Models (LLMs) into powerful text encoders. It targets researchers and practitioners seeking to leverage LLMs for tasks like semantic search, classification, and clustering, offering state-of-the-art performance with a straightforward recipe.

How It Works

LLM2Vec enables LLMs as text encoders through a three-step process: enabling bidirectional attention, training with masked next token prediction (MNTP), and unsupervised contrastive learning (SimCSE). This approach allows standard LLMs to capture bidirectional context, crucial for effective text representation, and then fine-tunes them for similarity tasks, outperforming traditional methods.

Quick Start & Requirements

  • Install via pip: pip install llm2vec and pip install flash-attn --no-build-isolation.
  • Requires Python and PyTorch. GPU with CUDA is recommended for performance.
  • Official HuggingFace models and documentation are linked within the README.

Highlighted Details

  • Supports Llama 3.1, 3.2, Gemma, Qwen-2, Llama-3, Mistral, and Llama-2 models.
  • Achieves state-of-the-art performance on the MTEB benchmark for models trained on public data.
  • Offers pre-trained checkpoints for supervised and unsupervised variants on HuggingFace.
  • Includes scripts for MNTP, SimCSE, supervised contrastive training, and word-level task fine-tuning.

Maintenance & Community

The project is actively updated, with recent additions including support for new models and evaluation scripts. Questions and issues can be raised on the GitHub repository.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README, but it relies on HuggingFace models, which typically have permissive licenses. Compatibility for commercial use would depend on the underlying LLM licenses.

Limitations & Caveats

The README mentions the need for flash-attention for optimal performance, which may require specific hardware and installation steps. Training custom models requires significant computational resources and datasets.

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
72 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.