Multimodal LLM for generating/retrieving images and generating text
Top 66.9% on sourcepulse
This repository provides the code and model weights for GILL, a multimodal language model capable of processing interleaved image and text inputs for text generation, image retrieval, and novel image generation. It is designed for researchers and practitioners in generative AI and multimodal learning.
How It Works
GILL integrates visual and textual modalities by treating images as special tokens within a large language model (LLM) framework. It leverages pre-trained visual embeddings (e.g., from CLIP) and trains linear layers and image embeddings to condition the LLM's generation process. This approach allows for flexible interaction with both modalities, enabling tasks like generating text descriptions for images or generating new images based on textual prompts.
Quick Start & Requirements
pip install -r requirements.txt
. Add the gill
library to your PYTHONPATH.requirements.txt
dependencies. For image retrieval, download ~3GB of precomputed visual embeddings for Conceptual Captions.Highlighted Details
Maintenance & Community
The project is associated with the NeurIPS 2023 paper "Generating Images with Multimodal Language Models" by Koh, Fried, and Salakhutdinov. No specific community channels (Discord/Slack) are mentioned.
Licensing & Compatibility
The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.
Limitations & Caveats
The provided precomputed embeddings for image retrieval are for Conceptual Captions and may have minor differences in image outputs due to lost URLs. Training a custom decision classifier is necessary if training a GILL model from scratch, as the provided one is tied to the original model weights.
1 year ago
1 day