Image captioning model using CLIP embeddings as a prefix
Top 29.7% on sourcepulse
This repository provides an implementation for image captioning using CLIP prefixes, addressing the need for a model that requires only images and captions for training, unlike methods relying on object annotations. It's suitable for researchers and practitioners looking for a faster training alternative to state-of-the-art methods, achieving comparable results on large datasets like Conceptual Captions.
How It Works
The approach leverages the pre-trained CLIP model's ability to generate semantic image encodings. These encodings are then used as a "prefix" for textual captions. A mapping network (either a simple MLP or a Transformer) learns to translate the raw CLIP encoding into a suitable prefix. This prefix is then fed into a fine-tuned language model (like GPT-2) to generate the final image caption. The Transformer variant avoids fine-tuning GPT-2, offering a lighter model.
Quick Start & Requirements
conda env create -f environment.yml
), and activate it (conda activate clip_prefix_caption
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive