PyTorch Lightning module for CLIP model training and fine-tuning
Top 49.4% on sourcepulse
This repository provides a PyTorch Lightning implementation for training OpenAI's CLIP model from scratch or fine-tuning it. It's designed for researchers and practitioners looking to replicate or adapt CLIP's capabilities for visual-language understanding tasks. The solution aims for ease of use and fidelity to the original CLIP paper.
How It Works
The project leverages PyTorch Lightning for a structured training pipeline. It supports training CLIP from scratch using specified model architectures (e.g., ResNet50, ViT-B/32) and a provided dataset directory. For data-efficient fine-tuning, it offers a CustomCLIPWrapper
that allows integrating pre-trained image encoders and Hugging Face text encoders, enabling faster adaptation with less data. The data loading mechanism expects image-caption pairs with matching stems and captions separated by newlines.
Quick Start & Requirements
python train.py --model_name <model_name> --folder <data_dir> --batchsize <batch_size>
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is a direct implementation of the CLIP training script and may not include all the latest optimizations or advanced features like gradient checkpointing or half-precision Adam statistics, which are listed as future work. The lack of an explicit license could pose a barrier for commercial adoption.
3 years ago
Inactive