MAGIC  by yxuansu

Framework for image-guided text generation using language models

created 3 years ago
257 stars

Top 98.8% on sourcepulse

GitHubView on GitHub
Project Summary

MAGIC is a training-free framework for integrating visual controls into text generation, enabling language models to perform multimodal tasks like image captioning and visually grounded story generation in a zero-shot manner. It targets researchers and developers working with large language models who need to incorporate visual grounding without extensive retraining. The primary benefit is achieving image-guided text generation with significant speedups over state-of-the-art methods.

How It Works

MAGIC combines an off-the-shelf language model (e.g., GPT-2) with an image-text matching model (e.g., CLIP). During decoding, it introduces a "magic score" derived from CLIP's embeddings. This score regularizes the language model's output to be semantically related to a given image while maintaining contextual coherence. This approach is advantageous as it's a plug-and-play solution, requiring no gradient updates or model fine-tuning, making it computationally efficient.

Quick Start & Requirements

Highlighted Details

  • Achieves a nearly 27x decoding speedup for zero-shot image captioning compared to state-of-the-art.
  • Outperforms existing methods on zero-shot image captioning tasks.
  • Demonstrates capability in visually grounded story generation.
  • Compatible with various text generation tasks requiring image grounding.

Maintenance & Community

  • Publicly released on May 6, 2022.
  • Contact: yxuansu@cam.ac.uk
  • Replicate provides a user-friendly demo.

Licensing & Compatibility

  • The README does not explicitly state a license. Code dependencies may impose their own licenses.

Limitations & Caveats

  • The repository does not specify a license, which may impact commercial use or integration into closed-source projects.
  • Requires specific versions of Python (3.8) and potentially specific versions of libraries due to the requirements.txt file.
Health Check
Last commit

3 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0.2%
459
Multimodal LLM for generating/retrieving images and generating text
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

Kandinsky-2 by ai-forever

0.0%
3k
Multilingual text-to-image latent diffusion model
created 2 years ago
updated 1 year ago
Feedback? Help us improve.