llm2vec by McGill-NLP

Text encoder recipe using decoder-only LLMs

Created 1 year ago

1,615 stars

Top 25.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jesse Clark

Cofounder of Marqo

Project Summary

This repository provides LLM2Vec, a method to transform decoder-only Large Language Models (LLMs) into powerful text encoders. It targets researchers and practitioners seeking to leverage LLMs for tasks like semantic search, classification, and clustering, offering state-of-the-art performance with a straightforward recipe.

How It Works

LLM2Vec enables LLMs as text encoders through a three-step process: enabling bidirectional attention, training with masked next token prediction (MNTP), and unsupervised contrastive learning (SimCSE). This approach allows standard LLMs to capture bidirectional context, crucial for effective text representation, and then fine-tunes them for similarity tasks, outperforming traditional methods.

Quick Start & Requirements

Install via pip: pip install llm2vec and pip install flash-attn --no-build-isolation.
Requires Python and PyTorch. GPU with CUDA is recommended for performance.
Official HuggingFace models and documentation are linked within the README.

Highlighted Details

Supports Llama 3.1, 3.2, Gemma, Qwen-2, Llama-3, Mistral, and Llama-2 models.
Achieves state-of-the-art performance on the MTEB benchmark for models trained on public data.
Offers pre-trained checkpoints for supervised and unsupervised variants on HuggingFace.
Includes scripts for MNTP, SimCSE, supervised contrastive training, and word-level task fine-tuning.

Maintenance & Community

The project is actively updated, with recent additions including support for new models and evaluation scripts. Questions and issues can be raised on the GitHub repository.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README, but it relies on HuggingFace models, which typically have permissive licenses. Compatibility for commercial use would depend on the underlying LLM licenses.

Limitations & Caveats

The README mentions the need for flash-attention for optimal performance, which may require specific hardware and installation steps. Training custom models requires significant computational resources and datasets.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days