VLM2Vec  by TIGER-AI-Lab

Research paper for multimodal embeddings using vision-language models

created 9 months ago
345 stars

Top 81.4% on sourcepulse

GitHubView on GitHub
Project Summary

VLM2Vec provides a framework for training unified vision-language models (VLMs) capable of generating high-quality multimodal embeddings for diverse tasks. It targets researchers and practitioners seeking a single, robust model for various downstream applications, offering state-of-the-art performance on benchmarks like MMEB.

How It Works

VLM2Vec converts existing well-trained VLMs into embedding models by using the last token of the sequence as the multimodal representation. This approach is advantageous as it leverages the power of pre-trained VLMs and is compatible with any open-source VLM backbone. By training on a diverse dataset (MMEB) encompassing various modalities, tasks, and instructions, VLM2Vec achieves robustness and generalization for universal embedding generation.

Quick Start & Requirements

  • Install: git clone the repository.
  • Data: Download MMEB-train and MMEB-eval datasets from Hugging Face.
  • Training: Requires torchrun, transformers, datasets, accelerate, peft, bitsandbytes, gradio, einops, sentencepiece, xformers, flash-attn. Supports LoRA tuning.
  • Inference/Evaluation: Requires torchrun, transformers, datasets, accelerate, peft, bitsandbytes, gradio, einops, sentencepiece, xformers, flash-attn.
  • Hardware: Recommended for GPUs with sufficient memory; GradCache can mitigate memory usage for smaller GPUs.
  • Links: MMEB Leaderboard, VLM2Vec-LLaVa-Next, vLLM Integration.

Highlighted Details

  • Achieves new state-of-the-art performance on the MMEB benchmark with Qwen2VL 7B models.
  • Compatible with various VLM backbones including Phi-3.5-vision-instruct, LLaVa-Next, and Qwen2VL.
  • Training data includes "original" and "diverse_instruction" splits for reproducibility and enhanced robustness.
  • Integrated into vLLM for efficient offline inference.

Maintenance & Community

  • Active development with recent releases of new models and features.
  • Changelog available for tracking changes.
  • Community engagement encouraged via issues.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. Code adapted from Tevatron, which is Apache 2.0 licensed. Model weights are typically released under their respective base model licenses (e.g., LLaVa, Qwen).

Limitations & Caveats

  • The specific license for the VLM2Vec code and datasets is not clearly stated in the README, which may impact commercial use.
  • Training requires significant computational resources and a large dataset.
Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
23
Star History
146 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.