Lumina-mGPT  by Alpha-VLLM

Multimodal autoregressive model for vision and language tasks

Created 1 year ago
622 stars

Top 53.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Lumina-mGPT is a family of autoregressive multimodal models designed for flexible, photorealistic text-to-image generation and other vision-language tasks. It targets researchers and developers working with advanced generative AI, offering a unified framework for diverse multimodal applications.

How It Works

Lumina-mGPT employs a unified autoregressive approach, treating images as sequences of tokens. This allows it to handle various vision-language tasks, including image generation, image understanding, and image-to-image translation, within a single model architecture. The system leverages a VQ-VAE decoder and is built upon the xllmx module, an evolution of LLaMA2-Accessory, to support LLM-centered multimodal capabilities.

Quick Start & Requirements

  • Installation: Requires manual installation of the xllmx module. Detailed instructions are in INSTALL.md.
  • Prerequisites: Requires manual download of VQ-VAE weights from Meta and placement in lumina_mgpt/ckpts/chameleon/.
  • Demos: Three Gradio demos are available for image generation, image-to-image tasks, and freeform interaction.
    • Image Generation: python -u demos/demo_image_generation.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768 --target_size 768
    • Image2Image: python -u demos/demo_image2image.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni --target_size 768
    • Freeform: python -u demos/demo_freeform.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni --target_size 768
  • Inference: Can be performed using the FlexARInferenceSolver class. Example usage provided for image generation, understanding, and omni-potent tasks.
  • Links:

Highlighted Details

  • Supports flexible photorealistic text-to-image generation.
  • Unified framework for image generation, understanding, and image-to-image tasks.
  • Offers 7B and 34B parameter models with varying resolutions (512, 768, 1024).
  • Training code and documentation are released.

Maintenance & Community

  • Project actively releasing code and models.
  • Hiring for research positions; contact gaopengcuhk@gmail.com.

Licensing & Compatibility

  • License not explicitly stated in the README.

Limitations & Caveats

  • The Chameleon implementation in transformers requires manual VQ-VAE weight downloads.
  • Specific target_size arguments must match the checkpoint used.
Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.