LLM2CLIP  by microsoft

Multimodal learning research leveraging LLMs for enhanced CLIP visual encoding

created 1 year ago
531 stars

Top 60.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLM2CLIP enhances CLIP models by leveraging large language models (LLMs) as textual teachers, enabling richer visual representations and improved text-image alignment. This project is targeted at researchers and developers working with multimodal AI, offering a method to significantly boost CLIP's performance on tasks involving complex or long textual inputs.

How It Works

LLM2CLIP employs a Caption-to-Caption Contrastive Learning strategy. It fine-tunes CLIP's visual encoder while keeping the LLM's gradients frozen. The LLM is trained to differentiate between captions of the same or different images, improving the separability of its output space. This approach allows CLIP to benefit from the LLM's extended input window, enhanced understanding of dense captions, and open-world knowledge, leading to more efficient and powerful multimodal feature alignment.

Quick Start & Requirements

  • Installation: conda create -n llm2clip python=3.8, conda activate llm2clip, pip install -r requirements.txt.
  • Data Preparation: Requires downloading datasets like cc3m, cc12m, or yfcc15m and extracting embeddings.
  • Training: Initiated via sh run.sh.
  • Dependencies: Python 3.8, PyTorch, Hugging Face Transformers, datasets like CC3M/CC12M/YFCC15M.

Highlighted Details

  • Outperforms standard Chinese CLIP models despite being fine-tuned purely on English.
  • Significantly improves performance on long-text and short-text retrieval tasks.
  • Explores applications in specialized domains like medicine and law, leveraging LLM knowledge for data augmentation.
  • Future plans include scaled-up versions (10-100x larger) and support for video modalities.

Maintenance & Community

The project is associated with NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice. Updates on new models and datasets are announced on HuggingFace. The code is built upon EVA-CLIP.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as research with ongoing development, including plans for larger models and new modalities. Specific details on the integration and weighting of loss functions (SimCSE, MNTP) and dataset mixing strategies are clarified in the FAQ, indicating a research-oriented approach.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
4
Star History
28 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
352
Vision-language research paper using LLMs
created 2 years ago
updated 1 week ago
Feedback? Help us improve.