Framework for image-guided text generation using language models
Top 98.8% on sourcepulse
MAGIC is a training-free framework for integrating visual controls into text generation, enabling language models to perform multimodal tasks like image captioning and visually grounded story generation in a zero-shot manner. It targets researchers and developers working with large language models who need to incorporate visual grounding without extensive retraining. The primary benefit is achieving image-guided text generation with significant speedups over state-of-the-art methods.
How It Works
MAGIC combines an off-the-shelf language model (e.g., GPT-2) with an image-text matching model (e.g., CLIP). During decoding, it introduces a "magic score" derived from CLIP's embeddings. This score regularizes the language model's output to be semantically related to a given image while maintaining contextual coherence. This approach is advantageous as it's a plug-and-play solution, requiring no gradient updates or model fine-tuning, making it computationally efficient.
Quick Start & Requirements
pip3 install -r requirements.txt
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
requirements.txt
file.3 years ago
1 day