LLM2CLIP by microsoft

Multimodal learning research leveraging LLMs for enhanced CLIP visual encoding

Created 1 year ago

567 stars

Top 56.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

LLM2CLIP enhances CLIP models by leveraging large language models (LLMs) as textual teachers, enabling richer visual representations and improved text-image alignment. This project is targeted at researchers and developers working with multimodal AI, offering a method to significantly boost CLIP's performance on tasks involving complex or long textual inputs.

How It Works

LLM2CLIP employs a Caption-to-Caption Contrastive Learning strategy. It fine-tunes CLIP's visual encoder while keeping the LLM's gradients frozen. The LLM is trained to differentiate between captions of the same or different images, improving the separability of its output space. This approach allows CLIP to benefit from the LLM's extended input window, enhanced understanding of dense captions, and open-world knowledge, leading to more efficient and powerful multimodal feature alignment.

Quick Start & Requirements

Installation: conda create -n llm2clip python=3.8, conda activate llm2clip, pip install -r requirements.txt.
Data Preparation: Requires downloading datasets like cc3m, cc12m, or yfcc15m and extracting embeddings.
Training: Initiated via sh run.sh.
Dependencies: Python 3.8, PyTorch, Hugging Face Transformers, datasets like CC3M/CC12M/YFCC15M.

Highlighted Details

Outperforms standard Chinese CLIP models despite being fine-tuned purely on English.
Significantly improves performance on long-text and short-text retrieval tasks.
Explores applications in specialized domains like medicine and law, leveraging LLM knowledge for data augmentation.
Future plans include scaled-up versions (10-100x larger) and support for video modalities.

Maintenance & Community

The project is associated with NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice. Updates on new models and datasets are announced on HuggingFace. The code is built upon EVA-CLIP.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as research with ongoing development, including plans for larger models and new modalities. Specific details on the integration and weighting of loss functions (SimCSE, MNTP) and dataset mixing strategies are clarified in the FAQ, indicating a research-oriented approach.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days