Multimodal autoregressive model for vision and language tasks
Top 54.6% on sourcepulse
Lumina-mGPT is a family of autoregressive multimodal models designed for flexible, photorealistic text-to-image generation and other vision-language tasks. It targets researchers and developers working with advanced generative AI, offering a unified framework for diverse multimodal applications.
How It Works
Lumina-mGPT employs a unified autoregressive approach, treating images as sequences of tokens. This allows it to handle various vision-language tasks, including image generation, image understanding, and image-to-image translation, within a single model architecture. The system leverages a VQ-VAE decoder and is built upon the xllmx module, an evolution of LLaMA2-Accessory, to support LLM-centered multimodal capabilities.
Quick Start & Requirements
xllmx
module. Detailed instructions are in INSTALL.md
.lumina_mgpt/ckpts/chameleon/
.python -u demos/demo_image_generation.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768 --target_size 768
python -u demos/demo_image2image.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni --target_size 768
python -u demos/demo_freeform.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni --target_size 768
FlexARInferenceSolver
class. Example usage provided for image generation, understanding, and omni-potent tasks.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
transformers
requires manual VQ-VAE weight downloads.target_size
arguments must match the checkpoint used.4 months ago
1 day