Multimodal embeddings via LLM adaptation
Top 98.0% on sourcepulse
E5-V provides a framework for adapting Multimodal Large Language Models (MLLMs) to generate universal multimodal embeddings. It targets researchers and practitioners seeking to bridge modality gaps for tasks like cross-modal retrieval, demonstrating strong performance even without fine-tuning and outperforming multimodal training with a single-modality approach.
How It Works
E5-V leverages the architecture of MLLMs, specifically LLaVA-Next, to process and embed both text and image inputs. It extracts embeddings from the final hidden states of the model. A key innovation is the finding that training exclusively on text pairs can yield superior performance for multimodal embedding tasks compared to traditional multimodal training, effectively addressing the modality gap.
Quick Start & Requirements
pip install -r requirements.txt
.royokong/e5-v
model and processor from Hugging Face Transformers.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
7 months ago
Inactive