LaCLIP by LijieFan

Research paper code and models for improving CLIP training via language rewrites

Created 2 years ago

287 stars

Top 91.5% on SourcePulse

Project Summary

This repository provides code, text data, and pre-trained models for LaCLIP, a method that enhances CLIP training by using large language models to rewrite text descriptions. It targets researchers and practitioners in computer vision and natural language processing looking to improve vision-language model performance. LaCLIP achieves state-of-the-art zero-shot classification results by leveraging LLM-generated text augmentations.

How It Works

LaCLIP employs a two-stage process. First, it generates "meta-input-output pairs" using LLMs (like ChatGPT or Bard) to create diverse text rewrites for image captions. Second, it uses these pairs as in-context learning examples for LLaMA, enabling LLaMA to generate rewritten captions for large image-text datasets. This augmented dataset is then used to train CLIP models, leading to improved zero-shot capabilities.

Quick Start & Requirements

Rewrite Generation: Requires LLaMA weights, PyTorch (>=1.11.0), torchvision (>=0.12.0), and timm (>=0.5.4). The command involves setting environment variables for LLaMA paths and running llama_rewrite.py.
Zero-Shot Evaluation: Uses Python scripts (eval_zeroshot_imagenet.py or eval_zeroshot_imagenet_laion.py) with ImageNet dataset access.
Training: Requires PyTorch, torchvision, timm, and optionally open_clip. Training command involves torchrun and specifying training data, image roots, and augmented caption files.
Dependencies: LLaMA model access and setup are required for rewrite generation.

Highlighted Details

Achieves significant zero-shot performance gains on ImageNet across various datasets (CC3M, CC12M, RedCaps, LAION-400M) compared to vanilla CLIP.
Offers pre-computed augmented texts for CC3M, CC12M, and RedCaps datasets.
Provides code for generating rewrites, zero-shot evaluation, and training LaCLIP models.
Includes pre-trained models for both LaCLIP and vanilla CLIP on multiple datasets.

Maintenance & Community

The project is associated with NeurIPS 2023 and authored by Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The primary requirement for LLaMA model access and setup for text rewriting can be a significant barrier. The README also notes that the order of samples in augmented text files must precisely match the order in the training data files.

LaCLIP by LijieFan

Explore Similar Projects

CrossFlow by qihao067

LLM2CLIP by microsoft

METER by zdou0830

Binoculars by ahans30

magma by Aleph-Alpha-Research

SLIP by facebookresearch

finetune by IndicoDataSolutions

molmo by allenai

CLIP_benchmark by LAION-AI

setfit by huggingface

open_flamingo by mlfoundations

minimind-v by jingyaogong