Research paper exploring fine-tuned CLIP models for video learning
Top 93.3% on sourcepulse
This repository provides the official implementation for "Fine-tuned CLIP models are efficient video learners" (CVPR 2023). It demonstrates that a simple fine-tuning of pre-trained CLIP models (ViFi-CLIP) can effectively adapt them to the video domain, achieving competitive performance against more complex methods without explicit temporal modeling. The project is targeted at researchers and practitioners in video understanding and multi-modal learning.
How It Works
ViFi-CLIP adapts image-pretrained CLIP models for video tasks by fine-tuning the model on video data. The core idea is that frame-level processing via CLIP's image encoder, followed by feature pooling and similarity matching with text embeddings, implicitly captures temporal cues. This approach avoids the need for complex, dedicated temporal modules, leading to simpler models that are less prone to overfitting and exhibit better generalization. A "bridge and prompt" method is also proposed for low-data regimes, combining fine-tuning with prompt learning.
Quick Start & Requirements
INSTALL.md
.DATASETS.md
.python -m torch.distributed.launch --nproc_per_node=8 main.py -cfg configs/fully_supervised/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT
python -m torch.distributed.launch --nproc_per_node=8 main.py -cfg configs/fully_supervised/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT --only_test --resume /PATH/TO/CKPT --opts TEST.NUM_CLIP 4 TEST.NUM_CROP 3
Highlighted Details
Maintenance & Community
The project is the official implementation of a CVPR 2023 paper. Contact emails for questions are provided. The code is based on the XCLIP repository.
Licensing & Compatibility
The repository does not explicitly state a license. It is based on XCLIP, which is typically MIT licensed, but this should be verified.
Limitations & Caveats
The README does not explicitly state the license. The training and evaluation commands assume a distributed setup (e.g., 8 GPUs) and require specific data preparation steps.
1 year ago
1 day