MAGIC  by yxuansu

Framework for image-guided text generation using language models

Created 3 years ago
259 stars

Top 97.9% on SourcePulse

GitHubView on GitHub
Project Summary

MAGIC is a training-free framework for integrating visual controls into text generation, enabling language models to perform multimodal tasks like image captioning and visually grounded story generation in a zero-shot manner. It targets researchers and developers working with large language models who need to incorporate visual grounding without extensive retraining. The primary benefit is achieving image-guided text generation with significant speedups over state-of-the-art methods.

How It Works

MAGIC combines an off-the-shelf language model (e.g., GPT-2) with an image-text matching model (e.g., CLIP). During decoding, it introduces a "magic score" derived from CLIP's embeddings. This score regularizes the language model's output to be semantically related to a given image while maintaining contextual coherence. This approach is advantageous as it's a plug-and-play solution, requiring no gradient updates or model fine-tuning, making it computationally efficient.

Quick Start & Requirements

Highlighted Details

  • Achieves a nearly 27x decoding speedup for zero-shot image captioning compared to state-of-the-art.
  • Outperforms existing methods on zero-shot image captioning tasks.
  • Demonstrates capability in visually grounded story generation.
  • Compatible with various text generation tasks requiring image grounding.

Maintenance & Community

  • Publicly released on May 6, 2022.
  • Contact: yxuansu@cam.ac.uk
  • Replicate provides a user-friendly demo.

Licensing & Compatibility

  • The README does not explicitly state a license. Code dependencies may impose their own licenses.

Limitations & Caveats

  • The repository does not specify a license, which may impact commercial use or integration into closed-source projects.
  • Requires specific versions of Python (3.8) and potentially specific versions of libraries due to the requirements.txt file.
Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
7 more.

glide-text2im by openai

0.1%
4k
Text-conditional image synthesis model from research paper
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.