VLM2Vec by TIGER-AI-Lab

Research paper for multimodal embeddings using vision-language models

Created 1 year ago

540 stars

Top 58.8% on SourcePulse

Project Summary

VLM2Vec provides a framework for training unified vision-language models (VLMs) capable of generating high-quality multimodal embeddings for diverse tasks. It targets researchers and practitioners seeking a single, robust model for various downstream applications, offering state-of-the-art performance on benchmarks like MMEB.

How It Works

VLM2Vec converts existing well-trained VLMs into embedding models by using the last token of the sequence as the multimodal representation. This approach is advantageous as it leverages the power of pre-trained VLMs and is compatible with any open-source VLM backbone. By training on a diverse dataset (MMEB) encompassing various modalities, tasks, and instructions, VLM2Vec achieves robustness and generalization for universal embedding generation.

Quick Start & Requirements

Install: git clone the repository.
Data: Download MMEB-train and MMEB-eval datasets from Hugging Face.
Training: Requires torchrun, transformers, datasets, accelerate, peft, bitsandbytes, gradio, einops, sentencepiece, xformers, flash-attn. Supports LoRA tuning.
Inference/Evaluation: Requires torchrun, transformers, datasets, accelerate, peft, bitsandbytes, gradio, einops, sentencepiece, xformers, flash-attn.
Hardware: Recommended for GPUs with sufficient memory; GradCache can mitigate memory usage for smaller GPUs.
Links: MMEB Leaderboard, VLM2Vec-LLaVa-Next, vLLM Integration.

Highlighted Details

Achieves new state-of-the-art performance on the MMEB benchmark with Qwen2VL 7B models.
Compatible with various VLM backbones including Phi-3.5-vision-instruct, LLaVa-Next, and Qwen2VL.
Training data includes "original" and "diverse_instruction" splits for reproducibility and enhanced robustness.
Integrated into vLLM for efficient offline inference.

Maintenance & Community

Active development with recent releases of new models and features.
Changelog available for tracking changes.
Community engagement encouraged via issues.

Licensing & Compatibility

The repository itself does not explicitly state a license. Code adapted from Tevatron, which is Apache 2.0 licensed. Model weights are typically released under their respective base model licenses (e.g., LLaVa, Qwen).

Limitations & Caveats

The specific license for the VLM2Vec code and datasets is not clearly stated in the README, which may impact commercial use.
Training requires significant computational resources and a large dataset.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

4

Star History

29 stars in the last 30 days

Explore Similar Projects

UniTok by FoundationVision

Unified tokenizer for visual generation and understanding research

Created 10 months ago

Updated 1 month ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

cobra by h-zhao1997

Multimodal LLM research paper extending Mamba for efficient inference

Created 1 year ago

Updated 1 year ago

VARGPT by VARGPT-family

Multimodal LLM for visual understanding and generation tasks

Created 11 months ago

Updated 8 months ago

Visual-CoT by deepcs233

Research paper advancing multimodal language models

Created 1 year ago

Updated 1 year ago

Starred by

Nathan Lambert

Nathan Lambert(Research Scientist at AI2).

RLAIF-V by RLHF-V

Framework for aligning MLLMs using open-source AI feedback

Created 1 year ago

Updated 8 months ago

MedTrinity-25M by UCSC-VLAA

Large-scale multimodal dataset for medicine research

Created 1 year ago

Updated 6 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

Multi-Modality-Arena by OpenGVLab

Evaluation platform for large multi-modality models

Created 2 years ago

Updated 1 year ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

X-VLM by zengyan-97

Vision-language model for multi-grained alignment (ICML 2022 paper)

Created 4 years ago

Updated 3 years ago

Starred by

Burkay Gur

Burkay Gur(Cofounder of Fal.ai).

awesome-vlm-architectures by gokayfem

Vision-language models and their architectures

Created 1 year ago

Updated 10 months ago

Starred by

Phil Wang

Phil Wang(Prolific Research Paper Implementer).

molmo by allenai

Multimodal open language model code, training, and evaluation

Created 1 year ago

Updated 1 year ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA) and

Alex Chen

Alex Chen(Cofounder of Nexa AI).

EasyR1 by hiyouga

RL training framework for multi-modality models

Created 10 months ago

Updated 6 days ago

minimind-v by jingyaogong

VLM for training vision-language models from scratch

Created 1 year ago

Updated 2 weeks ago

Feedback? Help us improve.