Text encoder recipe using decoder-only LLMs
Top 27.2% on sourcepulse
This repository provides LLM2Vec, a method to transform decoder-only Large Language Models (LLMs) into powerful text encoders. It targets researchers and practitioners seeking to leverage LLMs for tasks like semantic search, classification, and clustering, offering state-of-the-art performance with a straightforward recipe.
How It Works
LLM2Vec enables LLMs as text encoders through a three-step process: enabling bidirectional attention, training with masked next token prediction (MNTP), and unsupervised contrastive learning (SimCSE). This approach allows standard LLMs to capture bidirectional context, crucial for effective text representation, and then fine-tunes them for similarity tasks, outperforming traditional methods.
Quick Start & Requirements
pip install llm2vec
and pip install flash-attn --no-build-isolation
.Highlighted Details
Maintenance & Community
The project is actively updated, with recent additions including support for new models and evaluation scripts. Questions and issues can be raised on the GitHub repository.
Licensing & Compatibility
The repository's licensing is not explicitly stated in the README, but it relies on HuggingFace models, which typically have permissive licenses. Compatibility for commercial use would depend on the underlying LLM licenses.
Limitations & Caveats
The README mentions the need for flash-attention
for optimal performance, which may require specific hardware and installation steps. Training custom models requires significant computational resources and datasets.
6 months ago
1 week