ClipCap-Chinese by yangjianxin1

Image captioning model using CLIP and GPT2

Created 3 years ago

320 stars

Top 84.9% on SourcePulse

Project Summary

This repository provides a Chinese image captioning model based on the ClipCap architecture, targeting researchers and developers interested in multimodal AI. It aims to bridge the semantic gap between image and text modalities by leveraging CLIP for image encoding and GPT-2 for caption generation, offering two distinct mapping network approaches for improved alignment.

How It Works

The core of ClipCap is an Encoder-Decoder model. A CLIP model encodes input images into a clip_embed vector. This vector is then mapped to a text embedding sequence (prefix_embeds) via a "Mapping Network," which acts as a bridge between the image and text spaces. Finally, a GPT-2 decoder uses these prefix_embeds to generate a descriptive caption. The project implements two mapping strategies: an MLP-based network and a Transformer-based network, both designed to align CLIP's image embeddings with GPT-2's text generation capabilities.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Data preprocessing: python process_flickr.py
Training MLP+GPT2: bash scripts/train_finetune_gpt2.sh
Prediction MLP+GPT2: bash scripts/predict_finetune_gpt2.sh
Requires Python, and the Flickr30k dataset with machine-translated Chinese captions.

Highlighted Details

Implements both MLP and Transformer-based mapping networks for CLIP-GPT2 alignment.
Demonstrates experimental results on a Chinese Flickr30k dataset, comparing the two mapping approaches.
Provides scripts for data processing, training, and prediction.
Discusses training dynamics, noting faster convergence and better performance with the MLP+GPT2 tuning approach.

Maintenance & Community

The repository is maintained by yangjianxin1. There are no explicit mentions of community channels or a roadmap in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. It references external projects like OpenAI's CLIP and rmokady's CLIP_prefix_caption, which are typically under permissive licenses (e.g., MIT). However, users should verify the specific license terms for this project.

Limitations & Caveats

The Chinese captions are machine-translated and may contain quality issues or biases (e.g., frequent use of quantifiers like "a" or "a group"). The GPT-2 model used has 12 layers, fewer than the original GPT2-Large (36 layers), which might impact generation quality. The dataset itself is noted as potentially biased.

ClipCap-Chinese by yangjianxin1

Explore Similar Projects

CrossFlow by qihao067

LaVi-Bridge by ShihaoZhaoZSH

MAGIC by yxuansu

fromage by kohjingyu

VisualGPT by Vision-CAIR

CLIP-Chinese by yangjianxin1

ELLA by TencentQQGYLab

OpenAI-CLIP by moein-shariatnia

X-Decoder by microsoft

CLIP_prefix_caption by rmokady

instruct-pix2pix by timothybrooks

DALLE-pytorch by lucidrains