Multimodal learning research leveraging LLMs for enhanced CLIP visual encoding
Top 60.4% on sourcepulse
LLM2CLIP enhances CLIP models by leveraging large language models (LLMs) as textual teachers, enabling richer visual representations and improved text-image alignment. This project is targeted at researchers and developers working with multimodal AI, offering a method to significantly boost CLIP's performance on tasks involving complex or long textual inputs.
How It Works
LLM2CLIP employs a Caption-to-Caption Contrastive Learning strategy. It fine-tunes CLIP's visual encoder while keeping the LLM's gradients frozen. The LLM is trained to differentiate between captions of the same or different images, improving the separability of its output space. This approach allows CLIP to benefit from the LLM's extended input window, enhanced understanding of dense captions, and open-world knowledge, leading to more efficient and powerful multimodal feature alignment.
Quick Start & Requirements
conda create -n llm2clip python=3.8
, conda activate llm2clip
, pip install -r requirements.txt
.cc3m
, cc12m
, or yfcc15m
and extracting embeddings.sh run.sh
.Highlighted Details
Maintenance & Community
The project is associated with NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice. Updates on new models and datasets are announced on HuggingFace. The code is built upon EVA-CLIP.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is presented as research with ongoing development, including plans for larger models and new modalities. Specific details on the integration and weighting of loss functions (SimCSE, MNTP) and dataset mixing strategies are clarified in the FAQ, indicating a research-oriented approach.
1 month ago
1 week