Multimodal model for grounding language models to images
Top 64.5% on sourcepulse
This repository provides code and model weights for FROMAGe, a system that grounds language models to images for multimodal inputs and outputs, as presented in an ICML 2023 paper. It enables text-to-image retrieval and image-conditioned text generation, targeting researchers and practitioners in multimodal AI.
How It Works
FROMAGe integrates visual information into large language models (LLMs) by adding trainable linear layers and a special "[RET]" embedding. This approach allows the LLM to condition its output on image content without requiring extensive retraining of the base LLM. The system leverages precomputed visual embeddings for efficient image retrieval.
Quick Start & Requirements
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:/path/to/fromage/
fromage_model/
.FROMAGe_example_notebook.ipynb
.dataset/
directory.Highlighted Details
fromage_vis4
) with 4 visual tokens for improved dialogue performance.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
export NCCL_P2P_DISABLE=1
) for GPUs with less memory or if encountering issues.1 year ago
1 day