fromage by kohjingyu

Multimodal model for grounding language models to images

Created 3 years ago

485 stars

Top 63.4% on SourcePulse

View on GitHub

7 Experts Love This Project

Jason Knight

Director AI Compilers at NVIDIA; Cofounder of OctoML

Travis Fischer

Founder of Agentic

Edward Sun

Research Scientist at Meta Superintelligence Lab

Jiayi Pan

Author of SWE-Gym; MTS at xAI

and 3 more!

Project Summary

This repository provides code and model weights for FROMAGe, a system that grounds language models to images for multimodal inputs and outputs, as presented in an ICML 2023 paper. It enables text-to-image retrieval and image-conditioned text generation, targeting researchers and practitioners in multimodal AI.

How It Works

FROMAGe integrates visual information into large language models (LLMs) by adding trainable linear layers and a special "[RET]" embedding. This approach allows the LLM to condition its output on image content without requiring extensive retraining of the base LLM. The system leverages precomputed visual embeddings for efficient image retrieval.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Add library to PYTHONPATH: export PYTHONPATH=$PYTHONPATH:/path/to/fromage/
Precomputed visual embeddings (3GB) are available at the provided URL and should be placed in fromage_model/.
Inference examples are available in FROMAGe_example_notebook.ipynb.
Training requires the Conceptual Captions dataset formatted as TSV files in the dataset/ directory.
Training command example provided, requires at least one A6000 GPU for 24-hour convergence with batch size 180.

Highlighted Details

Supports two model configurations: one with 1 visual token and another (fromage_vis4) with 4 visual tokens for improved dialogue performance.
Includes scripts for reproducing paper results on Visual Storytelling and VisDial datasets for both text generation and image retrieval tasks.
Offers a Gradio demo for local deployment.
Model weights are small (around 11MB) and can be pruned to save disk space.

Maintenance & Community

The project is associated with the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs."
No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes related to the ICML 2023 paper.

Limitations & Caveats

Some CC3M images used for reproduction may be lost over time, potentially causing minor differences in output.
Users may need to adjust batch sizes, enable gradient accumulation, or disable NCCL P2P (export NCCL_P2P_DISABLE=1) for GPUs with less memory or if encountering issues.
Unit tests may silently fail on I/O errors due to exception handling.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days