gill  by kohjingyu

Multimodal LLM for generating/retrieving images and generating text

Created 2 years ago
463 stars

Top 65.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the code and model weights for GILL, a multimodal language model capable of processing interleaved image and text inputs for text generation, image retrieval, and novel image generation. It is designed for researchers and practitioners in generative AI and multimodal learning.

How It Works

GILL integrates visual and textual modalities by treating images as special tokens within a large language model (LLM) framework. It leverages pre-trained visual embeddings (e.g., from CLIP) and trains linear layers and image embeddings to condition the LLM's generation process. This approach allows for flexible interaction with both modalities, enabling tasks like generating text descriptions for images or generating new images based on textual prompts.

Quick Start & Requirements

  • Install: Set up a virtual environment and run pip install -r requirements.txt. Add the gill library to your PYTHONPATH.
  • Prerequisites: Python, requirements.txt dependencies. For image retrieval, download ~3GB of precomputed visual embeddings for Conceptual Captions.
  • Resources: Training requires significant resources (e.g., 2 A6000 GPUs for 48 hours). Inference is less demanding.
  • Links: Paper, Project Webpage, Gradio Demo

Highlighted Details

  • Reproduces NeurIPS 2023 paper results for VIST and VisDial benchmarks.
  • Supports image retrieval and novel image generation from interleaved text/image inputs.
  • Includes scripts for training custom GILL models on Conceptual Captions (CC3M).
  • Provides a Gradio demo for local experimentation.

Maintenance & Community

The project is associated with the NeurIPS 2023 paper "Generating Images with Multimodal Language Models" by Koh, Fried, and Salakhutdinov. No specific community channels (Discord/Slack) are mentioned.

Licensing & Compatibility

The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.

Limitations & Caveats

The provided precomputed embeddings for image retrieval are for Conceptual Captions and may have minor differences in image outputs due to lost URLs. Training a custom decision classifier is necessary if training a GILL model from scratch, as the provided one is tied to the original model weights.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.