LaCLIP  by LijieFan

Research paper code and models for improving CLIP training via language rewrites

created 2 years ago
283 stars

Top 93.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code, text data, and pre-trained models for LaCLIP, a method that enhances CLIP training by using large language models to rewrite text descriptions. It targets researchers and practitioners in computer vision and natural language processing looking to improve vision-language model performance. LaCLIP achieves state-of-the-art zero-shot classification results by leveraging LLM-generated text augmentations.

How It Works

LaCLIP employs a two-stage process. First, it generates "meta-input-output pairs" using LLMs (like ChatGPT or Bard) to create diverse text rewrites for image captions. Second, it uses these pairs as in-context learning examples for LLaMA, enabling LLaMA to generate rewritten captions for large image-text datasets. This augmented dataset is then used to train CLIP models, leading to improved zero-shot capabilities.

Quick Start & Requirements

  • Rewrite Generation: Requires LLaMA weights, PyTorch (>=1.11.0), torchvision (>=0.12.0), and timm (>=0.5.4). The command involves setting environment variables for LLaMA paths and running llama_rewrite.py.
  • Zero-Shot Evaluation: Uses Python scripts (eval_zeroshot_imagenet.py or eval_zeroshot_imagenet_laion.py) with ImageNet dataset access.
  • Training: Requires PyTorch, torchvision, timm, and optionally open_clip. Training command involves torchrun and specifying training data, image roots, and augmented caption files.
  • Dependencies: LLaMA model access and setup are required for rewrite generation.

Highlighted Details

  • Achieves significant zero-shot performance gains on ImageNet across various datasets (CC3M, CC12M, RedCaps, LAION-400M) compared to vanilla CLIP.
  • Offers pre-computed augmented texts for CC3M, CC12M, and RedCaps datasets.
  • Provides code for generating rewrites, zero-shot evaluation, and training LaCLIP models.
  • Includes pre-trained models for both LaCLIP and vanilla CLIP on multiple datasets.

Maintenance & Community

The project is associated with NeurIPS 2023 and authored by Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The primary requirement for LLaMA model access and setup for text rewriting can be a significant barrier. The README also notes that the order of samples in augmented text files must precisely match the order in the training data files.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
352
Vision-language research paper using LLMs
created 2 years ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.