vision-language-models-are-bows  by mertyg

Research paper experiments and data for vision-language models

created 2 years ago
280 stars

Top 93.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides experimental code and datasets for the ICLR 2023 Oral paper "When and why vision-language models behave like bags-of-words, and what to do about it?". It targets researchers and practitioners in vision-language modeling who want to understand and mitigate the "bag-of-words" phenomenon in these models, offering tools to reproduce findings and analyze model behavior.

How It Works

The project introduces novel datasets (VG-Relation, VG-Attribution, COCO-Order, Flickr30k-Order) designed to probe vision-language models for their reliance on simple word co-occurrence rather than genuine visual understanding. It also includes implementations and interfaces for various VLM architectures (BLIP, CLIP, Flava, XVLM) and a modified training script for NegCLIP, demonstrating how to train models to be less susceptible to the bag-of-words bias.

Quick Start & Requirements

  • Install via pip (requires PyTorch, CLIP, etc.).
  • Requires cuda for GPU acceleration.
  • Datasets are ~1GB and can be downloaded via the download=True flag.
  • Notebooks are available for reproducing experiments: notebooks/

Highlighted Details

  • Investigates why VLMs behave like bag-of-words.
  • Introduces ARO Benchmark, VG-Relation, VG-Attribution, COCO-Order, and Flickr30k-Order datasets.
  • Provides interfaces for BLIP, CLIP, Flava, XVLM, and NegCLIP.
  • Includes a script to reproduce NegCLIP training on a single GPU.

Maintenance & Community

  • The project is associated with the ICLR 2023 paper.
  • Contact: merty@stanford.edu.
  • TODOs include adding support for distributed training and negative generation.

Licensing & Compatibility

  • The repository itself does not explicitly state a license.
  • It heavily relies on and integrates code from other repositories (BLIP, CLIP, Flava, OpenCLIP, XVLM), which have their own licenses. Users must adhere to the licenses of these underlying projects.

Limitations & Caveats

The NegCLIP training script is currently limited to single-GPU execution, and the release of code and camera-ready version experienced delays due to unforeseen circumstances.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.