Discover and explore top open-source AI tools and projects—updated daily.
kongdsMultimodal embeddings via LLM adaptation
Top 94.7% on SourcePulse
E5-V provides a framework for adapting Multimodal Large Language Models (MLLMs) to generate universal multimodal embeddings. It targets researchers and practitioners seeking to bridge modality gaps for tasks like cross-modal retrieval, demonstrating strong performance even without fine-tuning and outperforming multimodal training with a single-modality approach.
How It Works
E5-V leverages the architecture of MLLMs, specifically LLaVA-Next, to process and embed both text and image inputs. It extracts embeddings from the final hidden states of the model. A key innovation is the finding that training exclusively on text pairs can yield superior performance for multimodal embedding tasks compared to traditional multimodal training, effectively addressing the modality gap.
Quick Start & Requirements
pip install -r requirements.txt.royokong/e5-v model and processor from Hugging Face Transformers.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
10 months ago
1 day
OpenMOSS
airsplay
mlfoundations
NExT-GPT