Awesome-Prompting-on-Vision-Language-Model by JindongGu

Survey paper for vision-language model prompt engineering

Created 2 years ago

505 stars

Top 61.7% on SourcePulse

Project Summary

This repository serves as a curated collection of research papers on prompt engineering for Vision-Language Models (VLMs). It targets researchers and practitioners in AI and computer vision, offering a structured overview of techniques for adapting VLMs to various tasks like multimodal-to-text generation, image-text matching, and text-to-image synthesis. The primary benefit is a centralized, categorized resource for understanding the evolving landscape of VLM prompting.

How It Works

The repository categorizes prompting methods into "hard prompts" (task instructions, in-context learning, retrieval-based, chain-of-thought) and "soft prompts" (prompt tuning, prefix tuning), focusing on techniques that do not alter the base VLM architecture. It covers three main VLM types: multimodal-to-text generation (e.g., Flamingo), image-text matching (e.g., CLIP), and text-to-image generation (e.g., Stable Diffusion). The papers are organized by VLM type and prompting category, providing titles, venues, years, and code availability.

Quick Start & Requirements

This repository is a curated list of papers and does not have direct installation or execution requirements. It serves as a reference guide.

Highlighted Details

Comprehensive coverage of prompting techniques across three major VLM categories.
Detailed categorization of prompting methods (hard vs. soft, sub-categories).
Links to code repositories are provided where available for many listed papers.
Includes papers on applications, responsible AI, adversarial attacks, and bias in VLMs.

Maintenance & Community

The repository is maintained by Jindong Gu and Shuo Chen, with contact information provided for contributions, corrections, and suggestions. The primary reference is the survey paper "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models."

Licensing & Compatibility

The repository itself does not specify a license. The licensing of the individual papers and their associated code would need to be checked on a per-paper basis.

Limitations & Caveats

The repository is a static list of papers and does not provide executable code or models. The rapidly evolving nature of the field means new research may not be immediately incorporated.

Awesome-Prompting-on-Vision-Language-Model by JindongGu

Explore Similar Projects

lens by ContextualAI

ViP-LLaVA by WisconsinAIVision

METER by zdou0830

Vary-toy by Ucas-HaoranWei

awesome-prompts by songtianlun

Awesome-CLIP by yzhuoning

Vary by Ucas-HaoranWei

Awesome-Text-to-Image by Yutong-Zhou-cv

Oscar by microsoft

smollm by huggingface

DeepSeek-VL by deepseek-ai

DeepSeek-VL2 by deepseek-ai