Post-training method for improving CLIP models
Top 93.3% on sourcepulse
DIVA is a post-training method designed to enhance the visual understanding capabilities of CLIP models by leveraging generative feedback from diffusion models. It targets researchers and developers working with multimodal large language models (MLLMs) and vision models, aiming to improve fine-grained visual recognition and multimodal understanding tasks without requiring paired image-text data.
How It Works
DIVA utilizes a diffusion model as a "visual assistant" to optimize CLIP's image representations. The CLIP model encodes image features, which then condition a diffusion model. This diffusion model predicts noise added to a noisy image, and the CLIP representation is optimized by maximizing image likelihood through this diffusion loss. This approach allows for self-supervised refinement of CLIP's visual perception using generative feedback.
Quick Start & Requirements
conda create -n diva python=3.9
), activate it (conda activate diva
), and install requirements (pip install -r requirements.txt
).image2dataset
and placed in the datasets/
folder.clip/model.py
, open_clip/transformer.py
, timm/models/vision_transformer.py
) with provided modified versions.DIVA_for_OpenAICLIP.sh
).Highlighted Details
Maintenance & Community
The project is associated with BAAI and its authors have affiliations with CASIA, UCAS, and BJTU. The project is recent, with code and weights released in August 2024, and the paper accepted by ICLR 2025.
Licensing & Compatibility
The repository does not explicitly state a license. The underlying libraries used (PyTorch, OpenCLIP, Timm) have permissive licenses (BSD-style). However, the specific licensing for commercial use of DIVA itself would require clarification.
Limitations & Caveats
The README notes that results on the MMVP_VLM benchmark using provided OpenAI CLIP weights may vary due to randomness in condition design and patch token selection during training and inference. Users are advised to try different random seeds if results do not meet expectations.
6 months ago
1 week