Research paper on multimodal prompt learning for vision-language models
Top 46.7% on sourcepulse
This repository provides the official implementation for MaPLe (Multi-modal Prompt Learning), a CVPR 2023 paper. It addresses the sub-optimality of adapting only one modality (vision or language) in CLIP-like models by proposing a method that learns prompts for both branches simultaneously, fostering synergy and improving generalization to novel classes and unseen domain shifts. The target audience includes researchers and practitioners working with vision-language models seeking enhanced performance and adaptability.
How It Works
MaPLe learns prompts for both the vision and language branches of CLIP, explicitly conditioning vision prompts on their language counterparts. This coupling allows for mutual propagation of gradients, promoting synergy between modalities. Furthermore, it employs deep prompting, learning multi-modal prompts across multiple transformer blocks in both branches to progressively capture synergistic behavior and rich context. This approach aims to overcome the limitations of uni-modal prompting by enabling dynamic adjustment of both representation spaces.
Quick Start & Requirements
INSTALL.md
.DATASETS.md
.RUN.md
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not detail specific limitations, unsupported platforms, or known bugs. The project is presented as an official implementation of a published research paper.
2 years ago
1 week