E5-V  by kongds

Multimodal embeddings via LLM adaptation

Created 1 year ago
270 stars

Top 95.2% on SourcePulse

GitHubView on GitHub
Project Summary

E5-V provides a framework for adapting Multimodal Large Language Models (MLLMs) to generate universal multimodal embeddings. It targets researchers and practitioners seeking to bridge modality gaps for tasks like cross-modal retrieval, demonstrating strong performance even without fine-tuning and outperforming multimodal training with a single-modality approach.

How It Works

E5-V leverages the architecture of MLLMs, specifically LLaVA-Next, to process and embed both text and image inputs. It extracts embeddings from the final hidden states of the model. A key innovation is the finding that training exclusively on text pairs can yield superior performance for multimodal embedding tasks compared to traditional multimodal training, effectively addressing the modality gap.

Quick Start & Requirements

  • Install via pip install -r requirements.txt.
  • Requires Python, PyTorch, Hugging Face Transformers, and Accelerate.
  • GPU with CUDA is recommended for performance.
  • Example usage involves loading the royokong/e5-v model and processor from Hugging Face Transformers.

Highlighted Details

  • Achieves strong performance on multimodal embedding tasks.
  • Single modality training approach shows superior results over multimodal training.
  • Supports evaluation on COCO, Flickr30k, FashionIQ, CIRR, and STS benchmarks.
  • Codebase is based on SimCSE and Alpaca-LoRA.

Maintenance & Community

  • The project is maintained by kongds.
  • No specific community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The underlying models used (LLaVA-Next, Llama 3) have their own licenses which may impose restrictions.

Limitations & Caveats

  • The project's licensing is not specified, which could impact commercial use.
  • Training instructions require downloading specific NLI datasets and converting Llama 3 models to Hugging Face format, which can be resource-intensive.
Health Check
Last Commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

X-LLM by phellonchen

0%
314
Multimodal LLM research paper
Created 2 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Wing Lian Wing Lian(Founder of Axolotl AI).

AnyGPT by OpenMOSS

0.2%
863
Multimodal LLM research paper for any-to-any modality conversion
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.