Survey paper for vision-language model prompt engineering
Top 65.1% on sourcepulse
This repository serves as a curated collection of research papers on prompt engineering for Vision-Language Models (VLMs). It targets researchers and practitioners in AI and computer vision, offering a structured overview of techniques for adapting VLMs to various tasks like multimodal-to-text generation, image-text matching, and text-to-image synthesis. The primary benefit is a centralized, categorized resource for understanding the evolving landscape of VLM prompting.
How It Works
The repository categorizes prompting methods into "hard prompts" (task instructions, in-context learning, retrieval-based, chain-of-thought) and "soft prompts" (prompt tuning, prefix tuning), focusing on techniques that do not alter the base VLM architecture. It covers three main VLM types: multimodal-to-text generation (e.g., Flamingo), image-text matching (e.g., CLIP), and text-to-image generation (e.g., Stable Diffusion). The papers are organized by VLM type and prompting category, providing titles, venues, years, and code availability.
Quick Start & Requirements
This repository is a curated list of papers and does not have direct installation or execution requirements. It serves as a reference guide.
Highlighted Details
Maintenance & Community
The repository is maintained by Jindong Gu and Shuo Chen, with contact information provided for contributions, corrections, and suggestions. The primary reference is the survey paper "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models."
Licensing & Compatibility
The repository itself does not specify a license. The licensing of the individual papers and their associated code would need to be checked on a per-paper basis.
Limitations & Caveats
The repository is a static list of papers and does not provide executable code or models. The rapidly evolving nature of the field means new research may not be immediately incorporated.
4 months ago
Inactive