Awesome-Prompting-on-Vision-Language-Model  by JindongGu

Survey paper for vision-language model prompt engineering

Created 2 years ago
483 stars

Top 63.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository serves as a curated collection of research papers on prompt engineering for Vision-Language Models (VLMs). It targets researchers and practitioners in AI and computer vision, offering a structured overview of techniques for adapting VLMs to various tasks like multimodal-to-text generation, image-text matching, and text-to-image synthesis. The primary benefit is a centralized, categorized resource for understanding the evolving landscape of VLM prompting.

How It Works

The repository categorizes prompting methods into "hard prompts" (task instructions, in-context learning, retrieval-based, chain-of-thought) and "soft prompts" (prompt tuning, prefix tuning), focusing on techniques that do not alter the base VLM architecture. It covers three main VLM types: multimodal-to-text generation (e.g., Flamingo), image-text matching (e.g., CLIP), and text-to-image generation (e.g., Stable Diffusion). The papers are organized by VLM type and prompting category, providing titles, venues, years, and code availability.

Quick Start & Requirements

This repository is a curated list of papers and does not have direct installation or execution requirements. It serves as a reference guide.

Highlighted Details

  • Comprehensive coverage of prompting techniques across three major VLM categories.
  • Detailed categorization of prompting methods (hard vs. soft, sub-categories).
  • Links to code repositories are provided where available for many listed papers.
  • Includes papers on applications, responsible AI, adversarial attacks, and bias in VLMs.

Maintenance & Community

The repository is maintained by Jindong Gu and Shuo Chen, with contact information provided for contributions, corrections, and suggestions. The primary reference is the survey paper "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models."

Licensing & Compatibility

The repository itself does not specify a license. The licensing of the individual papers and their associated code would need to be checked on a per-paper basis.

Limitations & Caveats

The repository is a static list of papers and does not provide executable code or models. The rapidly evolving nature of the field means new research may not be immediately incorporated.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.