ViFi-CLIP by muzairkhattak

Research paper exploring fine-tuned CLIP models for video learning

Created 3 years ago

301 stars

Top 88.7% on SourcePulse

Project Summary

This repository provides the official implementation for "Fine-tuned CLIP models are efficient video learners" (CVPR 2023). It demonstrates that a simple fine-tuning of pre-trained CLIP models (ViFi-CLIP) can effectively adapt them to the video domain, achieving competitive performance against more complex methods without explicit temporal modeling. The project is targeted at researchers and practitioners in video understanding and multi-modal learning.

How It Works

ViFi-CLIP adapts image-pretrained CLIP models for video tasks by fine-tuning the model on video data. The core idea is that frame-level processing via CLIP's image encoder, followed by feature pooling and similarity matching with text embeddings, implicitly captures temporal cues. This approach avoids the need for complex, dedicated temporal modules, leading to simpler models that are less prone to overfitting and exhibit better generalization. A "bridge and prompt" method is also proposed for low-data regimes, combining fine-tuning with prompt learning.

Quick Start & Requirements

Install: Follow instructions in INSTALL.md.
Data: Prepare datasets as per DATASETS.md.
Training: python -m torch.distributed.launch --nproc_per_node=8 main.py -cfg configs/fully_supervised/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT
Evaluation: python -m torch.distributed.launch --nproc_per_node=8 main.py -cfg configs/fully_supervised/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT --only_test --resume /PATH/TO/CKPT --opts TEST.NUM_CLIP 4 TEST.NUM_CROP 3
Prerequisites: PyTorch, distributed training setup (e.g., 8 GPUs recommended for training).

Highlighted Details

Achieves competitive zero-shot, base-to-novel generalization, few-shot, and fully supervised performance on benchmarks like UCF-101, HMDB-51, Kinetics-400, and SSv2.
Introduces a base-to-novel generalization benchmark for video action recognition.
Proposes a "bridge and prompt" approach for effective adaptation in low-data scenarios.
Provides interactive notebooks for inference without significant installation.

Maintenance & Community

The project is the official implementation of a CVPR 2023 paper. Contact emails for questions are provided. The code is based on the XCLIP repository.

Licensing & Compatibility

The repository does not explicitly state a license. It is based on XCLIP, which is typically MIT licensed, but this should be verified.

Limitations & Caveats

The README does not explicitly state the license. The training and evaluation commands assume a distributed setup (e.g., 8 GPUs) and require specific data preparation steps.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days