LLM2CLIP  by microsoft

Multimodal learning research leveraging LLMs for enhanced CLIP visual encoding

Created 1 year ago
546 stars

Top 58.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLM2CLIP enhances CLIP models by leveraging large language models (LLMs) as textual teachers, enabling richer visual representations and improved text-image alignment. This project is targeted at researchers and developers working with multimodal AI, offering a method to significantly boost CLIP's performance on tasks involving complex or long textual inputs.

How It Works

LLM2CLIP employs a Caption-to-Caption Contrastive Learning strategy. It fine-tunes CLIP's visual encoder while keeping the LLM's gradients frozen. The LLM is trained to differentiate between captions of the same or different images, improving the separability of its output space. This approach allows CLIP to benefit from the LLM's extended input window, enhanced understanding of dense captions, and open-world knowledge, leading to more efficient and powerful multimodal feature alignment.

Quick Start & Requirements

  • Installation: conda create -n llm2clip python=3.8, conda activate llm2clip, pip install -r requirements.txt.
  • Data Preparation: Requires downloading datasets like cc3m, cc12m, or yfcc15m and extracting embeddings.
  • Training: Initiated via sh run.sh.
  • Dependencies: Python 3.8, PyTorch, Hugging Face Transformers, datasets like CC3M/CC12M/YFCC15M.

Highlighted Details

  • Outperforms standard Chinese CLIP models despite being fine-tuned purely on English.
  • Significantly improves performance on long-text and short-text retrieval tasks.
  • Explores applications in specialized domains like medicine and law, leveraging LLM knowledge for data augmentation.
  • Future plans include scaled-up versions (10-100x larger) and support for video modalities.

Maintenance & Community

The project is associated with NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice. Updates on new models and datasets are announced on HuggingFace. The code is built upon EVA-CLIP.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as research with ongoing development, including plans for larger models and new modalities. Specific details on the integration and weighting of loss functions (SimCSE, MNTP) and dataset mixing strategies are clarified in the FAQ, indicating a research-oriented approach.

Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

CLIP_prefix_caption by rmokady

0.1%
1k
Image captioning model using CLIP embeddings as a prefix
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.