fromage  by kohjingyu

Multimodal model for grounding language models to images

created 2 years ago
482 stars

Top 64.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code and model weights for FROMAGe, a system that grounds language models to images for multimodal inputs and outputs, as presented in an ICML 2023 paper. It enables text-to-image retrieval and image-conditioned text generation, targeting researchers and practitioners in multimodal AI.

How It Works

FROMAGe integrates visual information into large language models (LLMs) by adding trainable linear layers and a special "[RET]" embedding. This approach allows the LLM to condition its output on image content without requiring extensive retraining of the base LLM. The system leverages precomputed visual embeddings for efficient image retrieval.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Add library to PYTHONPATH: export PYTHONPATH=$PYTHONPATH:/path/to/fromage/
  • Precomputed visual embeddings (3GB) are available at the provided URL and should be placed in fromage_model/.
  • Inference examples are available in FROMAGe_example_notebook.ipynb.
  • Training requires the Conceptual Captions dataset formatted as TSV files in the dataset/ directory.
  • Training command example provided, requires at least one A6000 GPU for 24-hour convergence with batch size 180.

Highlighted Details

  • Supports two model configurations: one with 1 visual token and another (fromage_vis4) with 4 visual tokens for improved dialogue performance.
  • Includes scripts for reproducing paper results on Visual Storytelling and VisDial datasets for both text generation and image retrieval tasks.
  • Offers a Gradio demo for local deployment.
  • Model weights are small (around 11MB) and can be pruned to save disk space.

Maintenance & Community

  • The project is associated with the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs."
  • No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The code is provided for research purposes related to the ICML 2023 paper.

Limitations & Caveats

  • Some CC3M images used for reproduction may be lost over time, potentially causing minor differences in output.
  • Users may need to adjust batch sizes, enable gradient accumulation, or disable NCCL P2P (export NCCL_P2P_DISABLE=1) for GPUs with less memory or if encountering issues.
  • Unit tests may silently fail on I/O errors due to exception handling.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0.2%
459
Multimodal LLM for generating/retrieving images and generating text
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.