CLIP_prefix_caption by rmokady

Image captioning model using CLIP embeddings as a prefix

Created 4 years ago

1,409 stars

Top 28.6% on SourcePulse

3 Experts Love This Project

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

hiyouga

Author of LLaMA-Factory

collin-burns

Author of MMLU; MTS at Anthropic

Project Summary

This repository provides an implementation for image captioning using CLIP prefixes, addressing the need for a model that requires only images and captions for training, unlike methods relying on object annotations. It's suitable for researchers and practitioners looking for a faster training alternative to state-of-the-art methods, achieving comparable results on large datasets like Conceptual Captions.

How It Works

The approach leverages the pre-trained CLIP model's ability to generate semantic image encodings. These encodings are then used as a "prefix" for textual captions. A mapping network (either a simple MLP or a Transformer) learns to translate the raw CLIP encoding into a suitable prefix. This prefix is then fed into a fine-tuned language model (like GPT-2) to generate the final image caption. The Transformer variant avoids fine-tuning GPT-2, offering a lighter model.

Quick Start & Requirements

Install: Clone the repo, create a conda environment (conda env create -f environment.yml), and activate it (conda activate clip_prefix_caption).
Prerequisites: Python, Conda, PyTorch. For training, requires downloading COCO or Conceptual Captions datasets and pre-computing CLIP features. Downloading Conceptual Captions images can take days.
Inference: Colab notebooks are provided for inference.
Demo: Huggingface Spaces demo available.

Highlighted Details

Achieves state-of-the-art results on Conceptual Captions and nocaps datasets.
Significantly faster training times compared to similar methods.
Offers two mapping network architectures: MLP and Transformer.
Transformer variant avoids fine-tuning GPT-2.

Maintenance & Community

The primary contact emails are ron.mokady@gmail.com and amirhertz@mail.tau.ac.il.
The repository is heavily based on CLIP and Hugging Face repositories.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. However, it heavily relies on CLIP (MIT License) and Hugging Face Transformers (Apache 2.0 License). Compatibility for commercial use would depend on the licensing of these underlying dependencies and any specific license chosen for this project's code.

Limitations & Caveats

The Huggingface Spaces demo does not support beam search.
Downloading and processing the Conceptual Captions dataset can be time-consuming.
The licensing of the project's code itself is not clearly stated, which may impact commercial use.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

7 stars in the last 30 days

Explore Similar Projects

MAGIC by yxuansu

Framework for image-guided text generation using language models

Created 3 years ago

Updated 3 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

LLM2CLIP by microsoft

Multimodal learning research leveraging LLMs for enhanced CLIP visual encoding

Created 1 year ago

Updated 1 month ago

Starred by

Jason Knight

Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML),

Travis Fischer

Travis Fischer(Founder of Agentic), and

5 more.

fromage by kohjingyu

Multimodal model for grounding language models to images

Created 3 years ago

Updated 2 years ago

ComfyUI_SLK_joy_caption_two by EvilBT

ComfyUI node for image captioning

Created 1 year ago

Updated 6 months ago

ClipCap-Chinese by yangjianxin1

Image captioning model using CLIP and GPT2

Created 3 years ago

Updated 3 years ago

VisualGPT by Vision-CAIR

Image captioning research paper (CVPR 2022)

Created 4 years ago

Updated 2 years ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen).

Awesome-CLIP by yzhuoning

CLIP resources list

Created 4 years ago

Updated 1 year ago

VLP by LuoweiZhou

Vision-language pre-training research paper for image captioning and VQA

Created 6 years ago

Updated 4 years ago

joycaption by fpgaminer

Image captioning VLM for diffusion model training, aiming for uncensored, open use

Created 1 year ago

Updated 4 months ago

describe-anything by NVlabs

Image/video captioning model for detailed localized descriptions

Created 9 months ago

Updated 6 months ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo),

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and

1 more.

CLIP_benchmark by LAION-AI

CLI tool for CLIP-like model evaluation across diverse tasks/datasets

Created 3 years ago

Updated 1 month ago

Starred by

Alexander Borzunov

Alexander Borzunov(Research Scientist at OpenAI),

Chaoyu Yang

Chaoyu Yang(Founder of Bento), and

10 more.

instruct-pix2pix by timothybrooks

Image editing model for instruction-based image manipulation

Created 3 years ago

Updated 1 year ago

Feedback? Help us improve.