Lumina-mGPT  by Alpha-VLLM

Multimodal autoregressive model for vision and language tasks

created 1 year ago
610 stars

Top 54.6% on sourcepulse

GitHubView on GitHub
Project Summary

Lumina-mGPT is a family of autoregressive multimodal models designed for flexible, photorealistic text-to-image generation and other vision-language tasks. It targets researchers and developers working with advanced generative AI, offering a unified framework for diverse multimodal applications.

How It Works

Lumina-mGPT employs a unified autoregressive approach, treating images as sequences of tokens. This allows it to handle various vision-language tasks, including image generation, image understanding, and image-to-image translation, within a single model architecture. The system leverages a VQ-VAE decoder and is built upon the xllmx module, an evolution of LLaMA2-Accessory, to support LLM-centered multimodal capabilities.

Quick Start & Requirements

  • Installation: Requires manual installation of the xllmx module. Detailed instructions are in INSTALL.md.
  • Prerequisites: Requires manual download of VQ-VAE weights from Meta and placement in lumina_mgpt/ckpts/chameleon/.
  • Demos: Three Gradio demos are available for image generation, image-to-image tasks, and freeform interaction.
    • Image Generation: python -u demos/demo_image_generation.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768 --target_size 768
    • Image2Image: python -u demos/demo_image2image.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni --target_size 768
    • Freeform: python -u demos/demo_freeform.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni --target_size 768
  • Inference: Can be performed using the FlexARInferenceSolver class. Example usage provided for image generation, understanding, and omni-potent tasks.
  • Links:

Highlighted Details

  • Supports flexible photorealistic text-to-image generation.
  • Unified framework for image generation, understanding, and image-to-image tasks.
  • Offers 7B and 34B parameter models with varying resolutions (512, 768, 1024).
  • Training code and documentation are released.

Maintenance & Community

  • Project actively releasing code and models.
  • Hiring for research positions; contact gaopengcuhk@gmail.com.

Licensing & Compatibility

  • License not explicitly stated in the README.

Limitations & Caveats

  • The Chameleon implementation in transformers requires manual VQ-VAE weight downloads.
  • Specific target_size arguments must match the checkpoint used.
Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
4 more.

taming-transformers by CompVis

0.1%
6k
Image synthesis research paper using transformers
created 4 years ago
updated 1 year ago
Feedback? Help us improve.