DIVA by baaivision

Post-training method for improving CLIP models

Created 1 year ago

300 stars

Top 88.9% on SourcePulse

Project Summary

DIVA is a post-training method designed to enhance the visual understanding capabilities of CLIP models by leveraging generative feedback from diffusion models. It targets researchers and developers working with multimodal large language models (MLLMs) and vision models, aiming to improve fine-grained visual recognition and multimodal understanding tasks without requiring paired image-text data.

How It Works

DIVA utilizes a diffusion model as a "visual assistant" to optimize CLIP's image representations. The CLIP model encodes image features, which then condition a diffusion model. This diffusion model predicts noise added to a noisy image, and the CLIP representation is optimized by maximizing image likelihood through this diffusion loss. This approach allows for self-supervised refinement of CLIP's visual perception using generative feedback.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n diva python=3.9), activate it (conda activate diva), and install requirements (pip install -r requirements.txt).
Prerequisites: PyTorch 2.0.0, open-clip-torch 2.24.0, timm 0.9.8. Requires downloading pre-trained weights for various CLIP models (OpenAI, MetaCLIP, SigLIP, DFN) and a diffusion model (SD-2-1-base).
Data: Datasets need to be prepared using image2dataset and placed in the datasets/ folder.
Code Modification: Requires replacing specific model implementation files within the installed PyTorch packages (e.g., clip/model.py, open_clip/transformer.py, timm/models/vision_transformer.py) with provided modified versions.
Training/Evaluation: Execute provided bash scripts (e.g., DIVA_for_OpenAICLIP.sh).
Links: Official Paper

Highlighted Details

Achieves significant improvements (3-7%) on the MMVP-VLM benchmark for fine-grained visual abilities.
Enhances performance in multimodal understanding and segmentation tasks.
Preserves CLIP's zero-shot capabilities across 29 image classification and retrieval benchmarks.
Supports multiple CLIP variants including OpenAI, MetaCLIP, SigLIP, and DFN.

Maintenance & Community

The project is associated with BAAI and its authors have affiliations with CASIA, UCAS, and BJTU. The project is recent, with code and weights released in August 2024, and the paper accepted by ICLR 2025.

Licensing & Compatibility

The repository does not explicitly state a license. The underlying libraries used (PyTorch, OpenCLIP, Timm) have permissive licenses (BSD-style). However, the specific licensing for commercial use of DIVA itself would require clarification.

Limitations & Caveats

The README notes that results on the MMVP_VLM benchmark using provided OpenAI CLIP weights may vary due to randomness in condition design and patch token selection during training and inference. Users are advised to try different random seeds if results do not meet expectations.

DIVA by baaivision

Explore Similar Projects

CLIP-fine-tune by zer0int

Awesome-CLIP by TalkUHulk

LLM2CLIP by microsoft

Liquid by FoundationVision

InstructCV by AlaaLab

CLIP-Guided-Diffusion by nerdyrodent

glid-3-xl by Jack000

multimodal-garment-designer by aimagelab

Awesome-CLIP by yzhuoning

BLIP3o by JiuhaiChen

Paint-by-Example by Fantasy-Studio

DiffusionCLIP by gwang-kim