VisualGPT by Vision-CAIR

Image captioning research paper (CVPR 2022)

Created 4 years ago

339 stars

Top 81.4% on SourcePulse

Project Summary

VisualGPT offers a data-efficient approach to image captioning by adapting pretrained language models, specifically GPT-2, as a decoder. This method targets researchers and practitioners in computer vision and natural language processing looking to leverage large language models for visual tasks with reduced data requirements.

How It Works

VisualGPT frames image captioning as a conditional language generation problem. It utilizes a pretrained GPT-2 model, treating it as a decoder that takes visual features as input. The core innovation lies in its data-efficient adaptation strategy, allowing effective fine-tuning of the large language model on image captioning tasks with significantly less data than traditional methods.

Quick Start & Requirements

Install: Clone the repository and create a conda environment using environment.yml. Activate the environment with conda activate visualgpt.
Prerequisites: Download GPT-2 PyTorch weights, spaCy English data (python -m spacy download en), and COCO dataset annotations and detections (coco_detections.hdf5).
Training: python train_visualGPT.py --batch_size 50 --head 12 --tau 0.2 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data
Links: Paper

Highlighted Details

Data-efficient adaptation of pretrained GPT-2 for image captioning.
CVPR 2022 publication.
Utilizes COCO dataset for training and evaluation.

Maintenance & Community

The project is associated with the CVPR 2022 conference. No specific community channels or active maintenance indicators are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project acknowledges resources from Meshed Memory Transformer and Transformers, implying potential licensing considerations from those projects. Commercial use compatibility is not specified.

Limitations & Caveats

The provided training script uses a train_percentage of 0.001, indicating it's designed for demonstration or fine-tuning on a small subset of data. The setup requires downloading specific model weights and dataset files, which may be substantial.

VisualGPT by Vision-CAIR

Explore Similar Projects

fromage by kohjingyu

ComfyUI_SLK_joy_caption_two by EvilBT

ClipCap-Chinese by yangjianxin1

Image2Paragraph by showlab

Awesome-CLIP by yzhuoning

VLP by LuoweiZhou

joycaption by fpgaminer

X-Decoder by microsoft

CLIP_prefix_caption by rmokady

recognize-anything by xinyu1205

open_flamingo by mlfoundations

LAVIS by salesforce