MiniGPT-5  by eric-ai-lab

Research paper for interleaved vision-and-language generation

created 1 year ago
861 stars

Top 42.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

MiniGPT-5 addresses the challenge of generating interleaved vision-and-language content, enabling coherent multimodal outputs. It is designed for researchers and developers working on advanced multimodal generation tasks, offering a novel approach to bridge text and image generation.

How It Works

The model employs a two-stage training strategy centered around "generative vokens," which act as a bridge for harmonized image-text outputs. This approach facilitates description-free multimodal generation, meaning it doesn't require explicit image descriptions during training. Classifier-free guidance is integrated to enhance the effectiveness of vokens for image generation.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n minigpt5 python=3.10), activate it (conda activate minigpt5), and install requirements (pip install -r requirements.txt).
  • Pretrained Weights: Requires downloading Vicuna V0 7B weights and the MiniGPT-4 aligned checkpoint.
  • MiniGPT-5 Checkpoints: Download Stage 1 (CC3M) and Stage 2 (VIST or MMDialog) checkpoints. Stage 2 requires Stage 1 weights.
  • Demo: Run python3 playground.py after setting IS_STAGE2=True and providing paths to Stage 1 and Stage 2 weights.
  • Datasets: Requires CC3M, VIST, and MMDialog datasets formatted as specified.
  • Hardware: GPU(s) are required for generation and training.

Highlighted Details

  • Achieves substantial improvements over the baseline Divter model on the MMDialog dataset.
  • Delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset.
  • Supports description-free multimodal generation.
  • Utilizes classifier-free guidance for enhanced image generation.

Maintenance & Community

The project is associated with the University of California, Santa Cruz. Citation details are provided in BibTeX format.

Licensing & Compatibility

The repository is released under the MIT License.

Limitations & Caveats

The project requires specific pretrained weights (Vicuna V0 7B, MiniGPT-4 aligned checkpoint) and custom-formatted datasets for full functionality. Evaluation and training scripts are provided, but users must manage dataset downloads and formatting.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
2 more.

glide-text2im by openai

0.1%
4k
Text-conditional image synthesis model from research paper
created 3 years ago
updated 1 year ago
Feedback? Help us improve.