VisualGPT  by Vision-CAIR

Image captioning research paper (CVPR 2022)

Created 4 years ago
339 stars

Top 81.3% on SourcePulse

GitHubView on GitHub
Project Summary

VisualGPT offers a data-efficient approach to image captioning by adapting pretrained language models, specifically GPT-2, as a decoder. This method targets researchers and practitioners in computer vision and natural language processing looking to leverage large language models for visual tasks with reduced data requirements.

How It Works

VisualGPT frames image captioning as a conditional language generation problem. It utilizes a pretrained GPT-2 model, treating it as a decoder that takes visual features as input. The core innovation lies in its data-efficient adaptation strategy, allowing effective fine-tuning of the large language model on image captioning tasks with significantly less data than traditional methods.

Quick Start & Requirements

  • Install: Clone the repository and create a conda environment using environment.yml. Activate the environment with conda activate visualgpt.
  • Prerequisites: Download GPT-2 PyTorch weights, spaCy English data (python -m spacy download en), and COCO dataset annotations and detections (coco_detections.hdf5).
  • Training: python train_visualGPT.py --batch_size 50 --head 12 --tau 0.2 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data
  • Links: Paper

Highlighted Details

  • Data-efficient adaptation of pretrained GPT-2 for image captioning.
  • CVPR 2022 publication.
  • Utilizes COCO dataset for training and evaluation.

Maintenance & Community

The project is associated with the CVPR 2022 conference. No specific community channels or active maintenance indicators are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project acknowledges resources from Meshed Memory Transformer and Transformers, implying potential licensing considerations from those projects. Commercial use compatibility is not specified.

Limitations & Caveats

The provided training script uses a train_percentage of 0.001, indicating it's designed for demonstration or fine-tuning on a small subset of data. The setup requires downloading specific model weights and dataset files, which may be substantial.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

CLIP_prefix_caption by rmokady

0.1%
1k
Image captioning model using CLIP embeddings as a prefix
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.2%
11k
Library for language-vision AI research
Created 3 years ago
Updated 10 months ago
Feedback? Help us improve.