gill  by kohjingyu

Multimodal LLM for generating/retrieving images and generating text

created 2 years ago
459 stars

Top 66.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the code and model weights for GILL, a multimodal language model capable of processing interleaved image and text inputs for text generation, image retrieval, and novel image generation. It is designed for researchers and practitioners in generative AI and multimodal learning.

How It Works

GILL integrates visual and textual modalities by treating images as special tokens within a large language model (LLM) framework. It leverages pre-trained visual embeddings (e.g., from CLIP) and trains linear layers and image embeddings to condition the LLM's generation process. This approach allows for flexible interaction with both modalities, enabling tasks like generating text descriptions for images or generating new images based on textual prompts.

Quick Start & Requirements

  • Install: Set up a virtual environment and run pip install -r requirements.txt. Add the gill library to your PYTHONPATH.
  • Prerequisites: Python, requirements.txt dependencies. For image retrieval, download ~3GB of precomputed visual embeddings for Conceptual Captions.
  • Resources: Training requires significant resources (e.g., 2 A6000 GPUs for 48 hours). Inference is less demanding.
  • Links: Paper, Project Webpage, Gradio Demo

Highlighted Details

  • Reproduces NeurIPS 2023 paper results for VIST and VisDial benchmarks.
  • Supports image retrieval and novel image generation from interleaved text/image inputs.
  • Includes scripts for training custom GILL models on Conceptual Captions (CC3M).
  • Provides a Gradio demo for local experimentation.

Maintenance & Community

The project is associated with the NeurIPS 2023 paper "Generating Images with Multimodal Language Models" by Koh, Fried, and Salakhutdinov. No specific community channels (Discord/Slack) are mentioned.

Licensing & Compatibility

The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.

Limitations & Caveats

The provided precomputed embeddings for image retrieval are for Conceptual Captions and may have minor differences in image outputs due to lost URLs. Training a custom decision classifier is necessary if training a GILL model from scratch, as the provided one is tied to the original model weights.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.