multimodal-prompt-learning  by muzairkhattak

Research paper on multimodal prompt learning for vision-language models

created 2 years ago
760 stars

Top 46.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for MaPLe (Multi-modal Prompt Learning), a CVPR 2023 paper. It addresses the sub-optimality of adapting only one modality (vision or language) in CLIP-like models by proposing a method that learns prompts for both branches simultaneously, fostering synergy and improving generalization to novel classes and unseen domain shifts. The target audience includes researchers and practitioners working with vision-language models seeking enhanced performance and adaptability.

How It Works

MaPLe learns prompts for both the vision and language branches of CLIP, explicitly conditioning vision prompts on their language counterparts. This coupling allows for mutual propagation of gradients, promoting synergy between modalities. Furthermore, it employs deep prompting, learning multi-modal prompts across multiple transformer blocks in both branches to progressively capture synergistic behavior and rich context. This approach aims to overcome the limitations of uni-modal prompting by enabling dynamic adjustment of both representation spaces.

Quick Start & Requirements

  • Installation instructions are detailed in INSTALL.md.
  • Data preparation instructions are in DATASETS.md.
  • Training and evaluation instructions are in RUN.md.
  • Requires specific dataset preparation and potentially large pre-trained models.

Highlighted Details

  • Achieves an absolute gain of 3.45% on novel classes and 2.72% on overall harmonic-mean compared to Co-CoOp, averaged over 11 diverse image recognition datasets.
  • Supports MaPLe, CoOp, Co-CoOp, Deep Vision Prompting, Deep Language Prompting, and Independent V-L Prompting architectures.
  • Pretrained models and training/evaluation codes are released.
  • Code is based on the Co-CoOp and CoOp repositories.

Maintenance & Community

  • The project is associated with CVPR 2023 and ICCV 2023 submissions.
  • Contact information for questions is provided via email and GitHub issues.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The project is presented as an official implementation of a published research paper.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.