ArGue  by xytian1008

Attribute-guided prompt tuning for vision-language models

Created 2 years ago
395 stars

Top 72.7% on SourcePulse

GitHubView on GitHub
Project Summary

ArGue enhances soft prompt tuning for Vision-Language Models (VLMs) by mitigating distribution shift and spurious correlations. It targets researchers and practitioners seeking improved performance in novel class prediction and out-of-distribution generalization tasks, offering a method to align VLMs more robustly with visual concepts.

How It Works

ArGue introduces three core components to soft prompt tuning. "Attribute-Guided Prompting" augments prompts with LLM-generated visual attributes ([soft tokens] + [class name] + [attribute]). "Attribute Sampling" refines this by clustering attributes semantically and selecting the most visually relevant ones (N=3 per class) based on CLIP text features and training image similarity, significantly reducing computational overhead while filtering irrelevant attributes. "Negative Prompting" (ArGue-N) further suppresses spurious correlations, particularly background cues, by training the model to output uniform distributions under specifically crafted negative prompts.

Quick Start & Requirements

  • Installation: Clone the repository, navigate to the directory, and run pip install -r requirements.txt. Install the dassl library separately following its official instructions.
  • Prerequisites: Python ≥ 3.7, a CUDA-compatible GPU, and core libraries including torch, dassl, and clip. Attribute generation requires access to the GPT-3 API.
  • Dataset Preparation: Download and prepare datasets following the instructions provided by the CoOp project. Supported datasets include ImageNet, Caltech101, OxfordPets, and others.
  • Attribute Generation: Use python generate_descriptors.py to generate attributes via GPT-3.
  • Attribute Sampling: Execute bash scripts/ARGUE/select_attr.sh to cluster attributes and select representative ones.
  • Training/Evaluation: Scripts are provided for novel class prediction (base2new_train.sh, base2new_test.sh) and OOD generalization (xd_train.sh, xd_test.sh).
  • Links: Project Page, arXiv, GitHub

Highlighted Details

  • ArGue-N achieves state-of-the-art performance in novel class prediction, outperforming LASP by +1.70% on average harmonic mean across 11 datasets.
  • It is the first prompt tuning method to surpass zero-shot CLIP on novel class accuracy in 10 out of 11 benchmarks.
  • Demonstrates consistent out-of-distribution generalization improvements across ImageNet variants, including +1.47% on ImageNet-A.
  • Each component—attribute guidance, sampling, and negative prompting—contributes incrementally to performance gains.

Maintenance & Community

The project builds upon established frameworks like CoOp/CoCoOp and utilizes the Dassl training framework and CLIP backbone. Attribute generation relies on GPT-3. No specific community channels (e.g., Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

This project is licensed under the MIT License, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Attribute generation requires access to the GPT-3 API, which may incur costs. Dataset preparation relies on external instructions from the CoOp project. The code release follows the paper's acceptance at CVPR 2024.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
252 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
356
Vision-language research paper using LLMs
Created 2 years ago
Updated 10 months ago
Feedback? Help us improve.