E5-V  by kongds

Multimodal embeddings via LLM adaptation

created 1 year ago
261 stars

Top 98.0% on sourcepulse

GitHubView on GitHub
Project Summary

E5-V provides a framework for adapting Multimodal Large Language Models (MLLMs) to generate universal multimodal embeddings. It targets researchers and practitioners seeking to bridge modality gaps for tasks like cross-modal retrieval, demonstrating strong performance even without fine-tuning and outperforming multimodal training with a single-modality approach.

How It Works

E5-V leverages the architecture of MLLMs, specifically LLaVA-Next, to process and embed both text and image inputs. It extracts embeddings from the final hidden states of the model. A key innovation is the finding that training exclusively on text pairs can yield superior performance for multimodal embedding tasks compared to traditional multimodal training, effectively addressing the modality gap.

Quick Start & Requirements

  • Install via pip install -r requirements.txt.
  • Requires Python, PyTorch, Hugging Face Transformers, and Accelerate.
  • GPU with CUDA is recommended for performance.
  • Example usage involves loading the royokong/e5-v model and processor from Hugging Face Transformers.

Highlighted Details

  • Achieves strong performance on multimodal embedding tasks.
  • Single modality training approach shows superior results over multimodal training.
  • Supports evaluation on COCO, Flickr30k, FashionIQ, CIRR, and STS benchmarks.
  • Codebase is based on SimCSE and Alpaca-LoRA.

Maintenance & Community

  • The project is maintained by kongds.
  • No specific community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The underlying models used (LLaVA-Next, Llama 3) have their own licenses which may impose restrictions.

Limitations & Caveats

  • The project's licensing is not specified, which could impact commercial use.
  • Training instructions require downloading specific NLI datasets and converting Llama 3 models to Hugging Face format, which can be resource-intensive.
Health Check
Last commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
16 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.