Research paper for multimodal embeddings using vision-language models
Top 81.4% on sourcepulse
VLM2Vec provides a framework for training unified vision-language models (VLMs) capable of generating high-quality multimodal embeddings for diverse tasks. It targets researchers and practitioners seeking a single, robust model for various downstream applications, offering state-of-the-art performance on benchmarks like MMEB.
How It Works
VLM2Vec converts existing well-trained VLMs into embedding models by using the last token of the sequence as the multimodal representation. This approach is advantageous as it leverages the power of pre-trained VLMs and is compatible with any open-source VLM backbone. By training on a diverse dataset (MMEB) encompassing various modalities, tasks, and instructions, VLM2Vec achieves robustness and generalization for universal embedding generation.
Quick Start & Requirements
git clone
the repository.torchrun
, transformers
, datasets
, accelerate
, peft
, bitsandbytes
, gradio
, einops
, sentencepiece
, xformers
, flash-attn
. Supports LoRA tuning.torchrun
, transformers
, datasets
, accelerate
, peft
, bitsandbytes
, gradio
, einops
, sentencepiece
, xformers
, flash-attn
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 days ago
1 day