ViFi-CLIP  by muzairkhattak

Research paper exploring fine-tuned CLIP models for video learning

created 2 years ago
283 stars

Top 93.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for "Fine-tuned CLIP models are efficient video learners" (CVPR 2023). It demonstrates that a simple fine-tuning of pre-trained CLIP models (ViFi-CLIP) can effectively adapt them to the video domain, achieving competitive performance against more complex methods without explicit temporal modeling. The project is targeted at researchers and practitioners in video understanding and multi-modal learning.

How It Works

ViFi-CLIP adapts image-pretrained CLIP models for video tasks by fine-tuning the model on video data. The core idea is that frame-level processing via CLIP's image encoder, followed by feature pooling and similarity matching with text embeddings, implicitly captures temporal cues. This approach avoids the need for complex, dedicated temporal modules, leading to simpler models that are less prone to overfitting and exhibit better generalization. A "bridge and prompt" method is also proposed for low-data regimes, combining fine-tuning with prompt learning.

Quick Start & Requirements

  • Install: Follow instructions in INSTALL.md.
  • Data: Prepare datasets as per DATASETS.md.
  • Training: python -m torch.distributed.launch --nproc_per_node=8 main.py -cfg configs/fully_supervised/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT
  • Evaluation: python -m torch.distributed.launch --nproc_per_node=8 main.py -cfg configs/fully_supervised/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT --only_test --resume /PATH/TO/CKPT --opts TEST.NUM_CLIP 4 TEST.NUM_CROP 3
  • Prerequisites: PyTorch, distributed training setup (e.g., 8 GPUs recommended for training).

Highlighted Details

  • Achieves competitive zero-shot, base-to-novel generalization, few-shot, and fully supervised performance on benchmarks like UCF-101, HMDB-51, Kinetics-400, and SSv2.
  • Introduces a base-to-novel generalization benchmark for video action recognition.
  • Proposes a "bridge and prompt" approach for effective adaptation in low-data scenarios.
  • Provides interactive notebooks for inference without significant installation.

Maintenance & Community

The project is the official implementation of a CVPR 2023 paper. Contact emails for questions are provided. The code is based on the XCLIP repository.

Licensing & Compatibility

The repository does not explicitly state a license. It is based on XCLIP, which is typically MIT licensed, but this should be verified.

Limitations & Caveats

The README does not explicitly state the license. The training and evaluation commands assume a distributed setup (e.g., 8 GPUs) and require specific data preparation steps.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.