Research paper experiments and data for vision-language models
Top 93.9% on sourcepulse
This repository provides experimental code and datasets for the ICLR 2023 Oral paper "When and why vision-language models behave like bags-of-words, and what to do about it?". It targets researchers and practitioners in vision-language modeling who want to understand and mitigate the "bag-of-words" phenomenon in these models, offering tools to reproduce findings and analyze model behavior.
How It Works
The project introduces novel datasets (VG-Relation, VG-Attribution, COCO-Order, Flickr30k-Order) designed to probe vision-language models for their reliance on simple word co-occurrence rather than genuine visual understanding. It also includes implementations and interfaces for various VLM architectures (BLIP, CLIP, Flava, XVLM) and a modified training script for NegCLIP, demonstrating how to train models to be less susceptible to the bag-of-words bias.
Quick Start & Requirements
pip
(requires PyTorch, CLIP, etc.).cuda
for GPU acceleration.download=True
flag.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The NegCLIP training script is currently limited to single-GPU execution, and the release of code and camera-ready version experienced delays due to unforeseen circumstances.
2 years ago
1 day