vision-language-models-are-bows by mertyg

Research paper experiments and data for vision-language models

created 2 years ago

280 stars

Top 93.9% on sourcepulse

Project Summary

This repository provides experimental code and datasets for the ICLR 2023 Oral paper "When and why vision-language models behave like bags-of-words, and what to do about it?". It targets researchers and practitioners in vision-language modeling who want to understand and mitigate the "bag-of-words" phenomenon in these models, offering tools to reproduce findings and analyze model behavior.

How It Works

The project introduces novel datasets (VG-Relation, VG-Attribution, COCO-Order, Flickr30k-Order) designed to probe vision-language models for their reliance on simple word co-occurrence rather than genuine visual understanding. It also includes implementations and interfaces for various VLM architectures (BLIP, CLIP, Flava, XVLM) and a modified training script for NegCLIP, demonstrating how to train models to be less susceptible to the bag-of-words bias.

Quick Start & Requirements

Install via pip (requires PyTorch, CLIP, etc.).
Requires cuda for GPU acceleration.
Datasets are ~1GB and can be downloaded via the download=True flag.
Notebooks are available for reproducing experiments: notebooks/

Highlighted Details

Investigates why VLMs behave like bag-of-words.
Introduces ARO Benchmark, VG-Relation, VG-Attribution, COCO-Order, and Flickr30k-Order datasets.
Provides interfaces for BLIP, CLIP, Flava, XVLM, and NegCLIP.
Includes a script to reproduce NegCLIP training on a single GPU.

Maintenance & Community

The project is associated with the ICLR 2023 paper.
Contact: merty@stanford.edu.
TODOs include adding support for distributed training and negative generation.

Licensing & Compatibility

The repository itself does not explicitly state a license.
It heavily relies on and integrates code from other repositories (BLIP, CLIP, Flava, OpenCLIP, XVLM), which have their own licenses. Users must adhere to the licenses of these underlying projects.

Limitations & Caveats

The NegCLIP training script is currently limited to single-GPU execution, and the release of code and camera-ready version experienced delays due to unforeseen circumstances.

vision-language-models-are-bows by mertyg

Explore Similar Projects

Vitron by SkyworkAI

BakLLaVA by SkunkworksAI

X-VLM by zengyan-97

Image2Paragraph by showlab

VLP by LuoweiZhou

Sa2VA by magic-research

smol-vision by merveenoyan

Vary by Ucas-HaoranWei

open_flamingo by mlfoundations

minimind-v by jingyaogong

KAIR by cszn

GroundingDINO by IDEA-Research