ArGue by xytian1008

Attribute-guided prompt tuning for vision-language models

Created 2 years ago

588 stars

Top 54.6% on SourcePulse

Project Summary

ArGue enhances soft prompt tuning for Vision-Language Models (VLMs) by mitigating distribution shift and spurious correlations. It targets researchers and practitioners seeking improved performance in novel class prediction and out-of-distribution generalization tasks, offering a method to align VLMs more robustly with visual concepts.

How It Works

ArGue introduces three core components to soft prompt tuning. "Attribute-Guided Prompting" augments prompts with LLM-generated visual attributes ([soft tokens] + [class name] + [attribute]). "Attribute Sampling" refines this by clustering attributes semantically and selecting the most visually relevant ones (N=3 per class) based on CLIP text features and training image similarity, significantly reducing computational overhead while filtering irrelevant attributes. "Negative Prompting" (ArGue-N) further suppresses spurious correlations, particularly background cues, by training the model to output uniform distributions under specifically crafted negative prompts.

Quick Start & Requirements

Installation: Clone the repository, navigate to the directory, and run pip install -r requirements.txt. Install the dassl library separately following its official instructions.
Prerequisites: Python ≥ 3.7, a CUDA-compatible GPU, and core libraries including torch, dassl, and clip. Attribute generation requires access to the GPT-3 API.
Dataset Preparation: Download and prepare datasets following the instructions provided by the CoOp project. Supported datasets include ImageNet, Caltech101, OxfordPets, and others.
Attribute Generation: Use python generate_descriptors.py to generate attributes via GPT-3.
Attribute Sampling: Execute bash scripts/ARGUE/select_attr.sh to cluster attributes and select representative ones.
Training/Evaluation: Scripts are provided for novel class prediction (base2new_train.sh, base2new_test.sh) and OOD generalization (xd_train.sh, xd_test.sh).
Links: Project Page, arXiv, GitHub

Highlighted Details

ArGue-N achieves state-of-the-art performance in novel class prediction, outperforming LASP by +1.70% on average harmonic mean across 11 datasets.
It is the first prompt tuning method to surpass zero-shot CLIP on novel class accuracy in 10 out of 11 benchmarks.
Demonstrates consistent out-of-distribution generalization improvements across ImageNet variants, including +1.47% on ImageNet-A.
Each component—attribute guidance, sampling, and negative prompting—contributes incrementally to performance gains.

Maintenance & Community

The project builds upon established frameworks like CoOp/CoCoOp and utilizes the Dassl training framework and CLIP backbone. Attribute generation relies on GPT-3. No specific community channels (e.g., Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

This project is licensed under the MIT License, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Attribute generation requires access to the GPT-3 API, which may incur costs. Dataset preparation relies on external instructions from the CoOp project. The code release follows the paper's acceptance at CVPR 2024.

ArGue by xytian1008

Explore Similar Projects

VLMCSHFG by GingerCohle

lens by ContextualAI

PromptSRC by muzairkhattak

ZegCLIP by ZiqinZhou66

tokenize-anything by baaivision

Awesome-CV-Foundational-Models by awaisrauf

Awesome-Open-Vocabulary-Semantic-Segmentation by Qinying-Liu

Awesome_Matching_Pretraining_Transfering by Paranioar

CLIP_benchmark by LAION-AI

LISA by JIA-Lab-research

recognize-anything by xinyu1205

VLM-R1 by om-ai-lab