MiniGPT-5 by eric-ai-lab

Research paper for interleaved vision-and-language generation

Created 2 years ago

863 stars

Top 41.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

MiniGPT-5 addresses the challenge of generating interleaved vision-and-language content, enabling coherent multimodal outputs. It is designed for researchers and developers working on advanced multimodal generation tasks, offering a novel approach to bridge text and image generation.

How It Works

The model employs a two-stage training strategy centered around "generative vokens," which act as a bridge for harmonized image-text outputs. This approach facilitates description-free multimodal generation, meaning it doesn't require explicit image descriptions during training. Classifier-free guidance is integrated to enhance the effectiveness of vokens for image generation.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n minigpt5 python=3.10), activate it (conda activate minigpt5), and install requirements (pip install -r requirements.txt).
Pretrained Weights: Requires downloading Vicuna V0 7B weights and the MiniGPT-4 aligned checkpoint.
MiniGPT-5 Checkpoints: Download Stage 1 (CC3M) and Stage 2 (VIST or MMDialog) checkpoints. Stage 2 requires Stage 1 weights.
Demo: Run python3 playground.py after setting IS_STAGE2=True and providing paths to Stage 1 and Stage 2 weights.
Datasets: Requires CC3M, VIST, and MMDialog datasets formatted as specified.
Hardware: GPU(s) are required for generation and training.

Highlighted Details

Achieves substantial improvements over the baseline Divter model on the MMDialog dataset.
Delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset.
Supports description-free multimodal generation.
Utilizes classifier-free guidance for enhanced image generation.

Maintenance & Community

The project is associated with the University of California, Santa Cruz. Citation details are provided in BibTeX format.

Licensing & Compatibility

The repository is released under the MIT License.

Limitations & Caveats

The project requires specific pretrained weights (Vicuna V0 7B, MiniGPT-4 aligned checkpoint) and custom-formatted datasets for full functionality. Evaluation and training scripts are provided, but users must manage dataset downloads and formatting.

Health Check

Last Commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days