VisualGPT  by Vision-CAIR

Image captioning research paper (CVPR 2022)

created 4 years ago
336 stars

Top 83.0% on sourcepulse

GitHubView on GitHub
Project Summary

VisualGPT offers a data-efficient approach to image captioning by adapting pretrained language models, specifically GPT-2, as a decoder. This method targets researchers and practitioners in computer vision and natural language processing looking to leverage large language models for visual tasks with reduced data requirements.

How It Works

VisualGPT frames image captioning as a conditional language generation problem. It utilizes a pretrained GPT-2 model, treating it as a decoder that takes visual features as input. The core innovation lies in its data-efficient adaptation strategy, allowing effective fine-tuning of the large language model on image captioning tasks with significantly less data than traditional methods.

Quick Start & Requirements

  • Install: Clone the repository and create a conda environment using environment.yml. Activate the environment with conda activate visualgpt.
  • Prerequisites: Download GPT-2 PyTorch weights, spaCy English data (python -m spacy download en), and COCO dataset annotations and detections (coco_detections.hdf5).
  • Training: python train_visualGPT.py --batch_size 50 --head 12 --tau 0.2 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data
  • Links: Paper

Highlighted Details

  • Data-efficient adaptation of pretrained GPT-2 for image captioning.
  • CVPR 2022 publication.
  • Utilizes COCO dataset for training and evaluation.

Maintenance & Community

The project is associated with the CVPR 2022 conference. No specific community channels or active maintenance indicators are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project acknowledges resources from Meshed Memory Transformer and Transformers, implying potential licensing considerations from those projects. Commercial use compatibility is not specified.

Limitations & Caveats

The provided training script uses a train_percentage of 0.001, indicating it's designed for demonstration or fine-tuning on a small subset of data. The setup requires downloading specific model weights and dataset files, which may be substantial.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.