Research paper code and models for improving CLIP training via language rewrites
Top 93.3% on sourcepulse
This repository provides code, text data, and pre-trained models for LaCLIP, a method that enhances CLIP training by using large language models to rewrite text descriptions. It targets researchers and practitioners in computer vision and natural language processing looking to improve vision-language model performance. LaCLIP achieves state-of-the-art zero-shot classification results by leveraging LLM-generated text augmentations.
How It Works
LaCLIP employs a two-stage process. First, it generates "meta-input-output pairs" using LLMs (like ChatGPT or Bard) to create diverse text rewrites for image captions. Second, it uses these pairs as in-context learning examples for LLaMA, enabling LLaMA to generate rewritten captions for large image-text datasets. This augmented dataset is then used to train CLIP models, leading to improved zero-shot capabilities.
Quick Start & Requirements
llama_rewrite.py
.eval_zeroshot_imagenet.py
or eval_zeroshot_imagenet_laion.py
) with ImageNet dataset access.open_clip
. Training command involves torchrun
and specifying training data, image roots, and augmented caption files.Highlighted Details
Maintenance & Community
The project is associated with NeurIPS 2023 and authored by Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. No specific community channels or roadmap are mentioned in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.
Limitations & Caveats
The primary requirement for LLaMA model access and setup for text rewriting can be a significant barrier. The README also notes that the order of samples in augmented text files must precisely match the order in the training data files.
1 year ago
1+ week