gill by kohjingyu

Multimodal LLM for generating/retrieving images and generating text

Created 2 years ago

470 stars

Top 64.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository provides the code and model weights for GILL, a multimodal language model capable of processing interleaved image and text inputs for text generation, image retrieval, and novel image generation. It is designed for researchers and practitioners in generative AI and multimodal learning.

How It Works

GILL integrates visual and textual modalities by treating images as special tokens within a large language model (LLM) framework. It leverages pre-trained visual embeddings (e.g., from CLIP) and trains linear layers and image embeddings to condition the LLM's generation process. This approach allows for flexible interaction with both modalities, enabling tasks like generating text descriptions for images or generating new images based on textual prompts.

Quick Start & Requirements

Install: Set up a virtual environment and run pip install -r requirements.txt. Add the gill library to your PYTHONPATH.
Prerequisites: Python, requirements.txt dependencies. For image retrieval, download ~3GB of precomputed visual embeddings for Conceptual Captions.
Resources: Training requires significant resources (e.g., 2 A6000 GPUs for 48 hours). Inference is less demanding.
Links: Paper, Project Webpage, Gradio Demo

Highlighted Details

Reproduces NeurIPS 2023 paper results for VIST and VisDial benchmarks.
Supports image retrieval and novel image generation from interleaved text/image inputs.
Includes scripts for training custom GILL models on Conceptual Captions (CC3M).
Provides a Gradio demo for local experimentation.

Maintenance & Community

The project is associated with the NeurIPS 2023 paper "Generating Images with Multimodal Language Models" by Koh, Fried, and Salakhutdinov. No specific community channels (Discord/Slack) are mentioned.

Licensing & Compatibility

The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.

Limitations & Caveats

The provided precomputed embeddings for image retrieval are for Conceptual Captions and may have minor differences in image outputs due to lost URLs. Training a custom decision classifier is necessary if training a GILL model from scratch, as the provided one is tied to the original model weights.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days