LaCLIP  by LijieFan

Research paper code and models for improving CLIP training via language rewrites

Created 2 years ago
286 stars

Top 91.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code, text data, and pre-trained models for LaCLIP, a method that enhances CLIP training by using large language models to rewrite text descriptions. It targets researchers and practitioners in computer vision and natural language processing looking to improve vision-language model performance. LaCLIP achieves state-of-the-art zero-shot classification results by leveraging LLM-generated text augmentations.

How It Works

LaCLIP employs a two-stage process. First, it generates "meta-input-output pairs" using LLMs (like ChatGPT or Bard) to create diverse text rewrites for image captions. Second, it uses these pairs as in-context learning examples for LLaMA, enabling LLaMA to generate rewritten captions for large image-text datasets. This augmented dataset is then used to train CLIP models, leading to improved zero-shot capabilities.

Quick Start & Requirements

  • Rewrite Generation: Requires LLaMA weights, PyTorch (>=1.11.0), torchvision (>=0.12.0), and timm (>=0.5.4). The command involves setting environment variables for LLaMA paths and running llama_rewrite.py.
  • Zero-Shot Evaluation: Uses Python scripts (eval_zeroshot_imagenet.py or eval_zeroshot_imagenet_laion.py) with ImageNet dataset access.
  • Training: Requires PyTorch, torchvision, timm, and optionally open_clip. Training command involves torchrun and specifying training data, image roots, and augmented caption files.
  • Dependencies: LLaMA model access and setup are required for rewrite generation.

Highlighted Details

  • Achieves significant zero-shot performance gains on ImageNet across various datasets (CC3M, CC12M, RedCaps, LAION-400M) compared to vanilla CLIP.
  • Offers pre-computed augmented texts for CC3M, CC12M, and RedCaps datasets.
  • Provides code for generating rewrites, zero-shot evaluation, and training LaCLIP models.
  • Includes pre-trained models for both LaCLIP and vanilla CLIP on multiple datasets.

Maintenance & Community

The project is associated with NeurIPS 2023 and authored by Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The primary requirement for LLaMA model access and setup for text rewriting can be a significant barrier. The README also notes that the order of samples in augmented text files must precisely match the order in the training data files.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.