DIVA  by baaivision

Post-training method for improving CLIP models

created 1 year ago
283 stars

Top 93.3% on sourcepulse

GitHubView on GitHub
Project Summary

DIVA is a post-training method designed to enhance the visual understanding capabilities of CLIP models by leveraging generative feedback from diffusion models. It targets researchers and developers working with multimodal large language models (MLLMs) and vision models, aiming to improve fine-grained visual recognition and multimodal understanding tasks without requiring paired image-text data.

How It Works

DIVA utilizes a diffusion model as a "visual assistant" to optimize CLIP's image representations. The CLIP model encodes image features, which then condition a diffusion model. This diffusion model predicts noise added to a noisy image, and the CLIP representation is optimized by maximizing image likelihood through this diffusion loss. This approach allows for self-supervised refinement of CLIP's visual perception using generative feedback.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n diva python=3.9), activate it (conda activate diva), and install requirements (pip install -r requirements.txt).
  • Prerequisites: PyTorch 2.0.0, open-clip-torch 2.24.0, timm 0.9.8. Requires downloading pre-trained weights for various CLIP models (OpenAI, MetaCLIP, SigLIP, DFN) and a diffusion model (SD-2-1-base).
  • Data: Datasets need to be prepared using image2dataset and placed in the datasets/ folder.
  • Code Modification: Requires replacing specific model implementation files within the installed PyTorch packages (e.g., clip/model.py, open_clip/transformer.py, timm/models/vision_transformer.py) with provided modified versions.
  • Training/Evaluation: Execute provided bash scripts (e.g., DIVA_for_OpenAICLIP.sh).
  • Links: Official Paper

Highlighted Details

  • Achieves significant improvements (3-7%) on the MMVP-VLM benchmark for fine-grained visual abilities.
  • Enhances performance in multimodal understanding and segmentation tasks.
  • Preserves CLIP's zero-shot capabilities across 29 image classification and retrieval benchmarks.
  • Supports multiple CLIP variants including OpenAI, MetaCLIP, SigLIP, and DFN.

Maintenance & Community

The project is associated with BAAI and its authors have affiliations with CASIA, UCAS, and BJTU. The project is recent, with code and weights released in August 2024, and the paper accepted by ICLR 2025.

Licensing & Compatibility

The repository does not explicitly state a license. The underlying libraries used (PyTorch, OpenCLIP, Timm) have permissive licenses (BSD-style). However, the specific licensing for commercial use of DIVA itself would require clarification.

Limitations & Caveats

The README notes that results on the MMVP_VLM benchmark using provided OpenAI CLIP weights may vary due to randomness in condition design and patch token selection during training and inference. Users are advised to try different random seeds if results do not meet expectations.

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.