llm2vec  by McGill-NLP

Text encoder recipe using decoder-only LLMs

Created 1 year ago
1,595 stars

Top 26.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides LLM2Vec, a method to transform decoder-only Large Language Models (LLMs) into powerful text encoders. It targets researchers and practitioners seeking to leverage LLMs for tasks like semantic search, classification, and clustering, offering state-of-the-art performance with a straightforward recipe.

How It Works

LLM2Vec enables LLMs as text encoders through a three-step process: enabling bidirectional attention, training with masked next token prediction (MNTP), and unsupervised contrastive learning (SimCSE). This approach allows standard LLMs to capture bidirectional context, crucial for effective text representation, and then fine-tunes them for similarity tasks, outperforming traditional methods.

Quick Start & Requirements

  • Install via pip: pip install llm2vec and pip install flash-attn --no-build-isolation.
  • Requires Python and PyTorch. GPU with CUDA is recommended for performance.
  • Official HuggingFace models and documentation are linked within the README.

Highlighted Details

  • Supports Llama 3.1, 3.2, Gemma, Qwen-2, Llama-3, Mistral, and Llama-2 models.
  • Achieves state-of-the-art performance on the MTEB benchmark for models trained on public data.
  • Offers pre-trained checkpoints for supervised and unsupervised variants on HuggingFace.
  • Includes scripts for MNTP, SimCSE, supervised contrastive training, and word-level task fine-tuning.

Maintenance & Community

The project is actively updated, with recent additions including support for new models and evaluation scripts. Questions and issues can be raised on the GitHub repository.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README, but it relies on HuggingFace models, which typically have permissive licenses. Compatibility for commercial use would depend on the underlying LLM licenses.

Limitations & Caveats

The README mentions the need for flash-attention for optimal performance, which may require specific hardware and installation steps. Training custom models requires significant computational resources and datasets.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

1.3%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 2 days ago
Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

textgrad by zou-group

0.7%
3k
Autograd engine for textual gradients, enabling LLM-driven optimization
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.