CLIP_prefix_caption  by rmokady

Image captioning model using CLIP embeddings as a prefix

created 3 years ago
1,387 stars

Top 29.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an implementation for image captioning using CLIP prefixes, addressing the need for a model that requires only images and captions for training, unlike methods relying on object annotations. It's suitable for researchers and practitioners looking for a faster training alternative to state-of-the-art methods, achieving comparable results on large datasets like Conceptual Captions.

How It Works

The approach leverages the pre-trained CLIP model's ability to generate semantic image encodings. These encodings are then used as a "prefix" for textual captions. A mapping network (either a simple MLP or a Transformer) learns to translate the raw CLIP encoding into a suitable prefix. This prefix is then fed into a fine-tuned language model (like GPT-2) to generate the final image caption. The Transformer variant avoids fine-tuning GPT-2, offering a lighter model.

Quick Start & Requirements

  • Install: Clone the repo, create a conda environment (conda env create -f environment.yml), and activate it (conda activate clip_prefix_caption).
  • Prerequisites: Python, Conda, PyTorch. For training, requires downloading COCO or Conceptual Captions datasets and pre-computing CLIP features. Downloading Conceptual Captions images can take days.
  • Inference: Colab notebooks are provided for inference.
  • Demo: Huggingface Spaces demo available.

Highlighted Details

  • Achieves state-of-the-art results on Conceptual Captions and nocaps datasets.
  • Significantly faster training times compared to similar methods.
  • Offers two mapping network architectures: MLP and Transformer.
  • Transformer variant avoids fine-tuning GPT-2.

Maintenance & Community

Licensing & Compatibility

  • The repository itself is not explicitly licensed in the README. However, it heavily relies on CLIP (MIT License) and Hugging Face Transformers (Apache 2.0 License). Compatibility for commercial use would depend on the licensing of these underlying dependencies and any specific license chosen for this project's code.

Limitations & Caveats

  • The Huggingface Spaces demo does not support beam search.
  • Downloading and processing the Conceptual Captions dataset can be time-consuming.
  • The licensing of the project's code itself is not clearly stated, which may impact commercial use.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.