Image captioning model using CLIP and GPT2
Top 88.4% on sourcepulse
This repository provides a Chinese image captioning model based on the ClipCap architecture, targeting researchers and developers interested in multimodal AI. It aims to bridge the semantic gap between image and text modalities by leveraging CLIP for image encoding and GPT-2 for caption generation, offering two distinct mapping network approaches for improved alignment.
How It Works
The core of ClipCap is an Encoder-Decoder model. A CLIP model encodes input images into a clip_embed
vector. This vector is then mapped to a text embedding sequence (prefix_embeds
) via a "Mapping Network," which acts as a bridge between the image and text spaces. Finally, a GPT-2 decoder uses these prefix_embeds
to generate a descriptive caption. The project implements two mapping strategies: an MLP-based network and a Transformer-based network, both designed to align CLIP's image embeddings with GPT-2's text generation capabilities.
Quick Start & Requirements
pip install -r requirements.txt
python process_flickr.py
bash scripts/train_finetune_gpt2.sh
bash scripts/predict_finetune_gpt2.sh
Highlighted Details
Maintenance & Community
The repository is maintained by yangjianxin1. There are no explicit mentions of community channels or a roadmap in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. It references external projects like OpenAI's CLIP and rmokady's CLIP_prefix_caption, which are typically under permissive licenses (e.g., MIT). However, users should verify the specific license terms for this project.
Limitations & Caveats
The Chinese captions are machine-translated and may contain quality issues or biases (e.g., frequent use of quantifiers like "a" or "a group"). The GPT-2 model used has 12 layers, fewer than the original GPT2-Large (36 layers), which might impact generation quality. The dataset itself is noted as potentially biased.
3 years ago
1+ week