E5-V by kongds

Multimodal embeddings via LLM adaptation

Created 1 year ago

273 stars

Top 94.7% on SourcePulse

Project Summary

E5-V provides a framework for adapting Multimodal Large Language Models (MLLMs) to generate universal multimodal embeddings. It targets researchers and practitioners seeking to bridge modality gaps for tasks like cross-modal retrieval, demonstrating strong performance even without fine-tuning and outperforming multimodal training with a single-modality approach.

How It Works

E5-V leverages the architecture of MLLMs, specifically LLaVA-Next, to process and embed both text and image inputs. It extracts embeddings from the final hidden states of the model. A key innovation is the finding that training exclusively on text pairs can yield superior performance for multimodal embedding tasks compared to traditional multimodal training, effectively addressing the modality gap.

Quick Start & Requirements

Install via pip install -r requirements.txt.
Requires Python, PyTorch, Hugging Face Transformers, and Accelerate.
GPU with CUDA is recommended for performance.
Example usage involves loading the royokong/e5-v model and processor from Hugging Face Transformers.

Highlighted Details

Achieves strong performance on multimodal embedding tasks.
Single modality training approach shows superior results over multimodal training.
Supports evaluation on COCO, Flickr30k, FashionIQ, CIRR, and STS benchmarks.
Codebase is based on SimCSE and Alpaca-LoRA.

Maintenance & Community

The project is maintained by kongds.
No specific community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The underlying models used (LLaVA-Next, Llama 3) have their own licenses which may impose restrictions.

Limitations & Caveats

The project's licensing is not specified, which could impact commercial use.
Training instructions require downloading specific NLI datasets and converting Llama 3 models to Hugging Face format, which can be resource-intensive.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days