ClipCap-Chinese  by yangjianxin1

Image captioning model using CLIP and GPT2

created 3 years ago
307 stars

Top 88.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a Chinese image captioning model based on the ClipCap architecture, targeting researchers and developers interested in multimodal AI. It aims to bridge the semantic gap between image and text modalities by leveraging CLIP for image encoding and GPT-2 for caption generation, offering two distinct mapping network approaches for improved alignment.

How It Works

The core of ClipCap is an Encoder-Decoder model. A CLIP model encodes input images into a clip_embed vector. This vector is then mapped to a text embedding sequence (prefix_embeds) via a "Mapping Network," which acts as a bridge between the image and text spaces. Finally, a GPT-2 decoder uses these prefix_embeds to generate a descriptive caption. The project implements two mapping strategies: an MLP-based network and a Transformer-based network, both designed to align CLIP's image embeddings with GPT-2's text generation capabilities.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Data preprocessing: python process_flickr.py
  • Training MLP+GPT2: bash scripts/train_finetune_gpt2.sh
  • Prediction MLP+GPT2: bash scripts/predict_finetune_gpt2.sh
  • Requires Python, and the Flickr30k dataset with machine-translated Chinese captions.

Highlighted Details

  • Implements both MLP and Transformer-based mapping networks for CLIP-GPT2 alignment.
  • Demonstrates experimental results on a Chinese Flickr30k dataset, comparing the two mapping approaches.
  • Provides scripts for data processing, training, and prediction.
  • Discusses training dynamics, noting faster convergence and better performance with the MLP+GPT2 tuning approach.

Maintenance & Community

The repository is maintained by yangjianxin1. There are no explicit mentions of community channels or a roadmap in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. It references external projects like OpenAI's CLIP and rmokady's CLIP_prefix_caption, which are typically under permissive licenses (e.g., MIT). However, users should verify the specific license terms for this project.

Limitations & Caveats

The Chinese captions are machine-translated and may contain quality issues or biases (e.g., frequent use of quantifiers like "a" or "a group"). The GPT-2 model used has 12 layers, fewer than the original GPT2-Large (36 layers), which might impact generation quality. The dataset itself is noted as potentially biased.

Health Check
Last commit

3 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.