Lumina-mGPT by Alpha-VLLM

Multimodal autoregressive model for vision and language tasks

Created 1 year ago

636 stars

Top 52.2% on SourcePulse

1 Expert Loves This Project

jiamings

Chief Scientist at Luma AI

Project Summary

Lumina-mGPT is a family of autoregressive multimodal models designed for flexible, photorealistic text-to-image generation and other vision-language tasks. It targets researchers and developers working with advanced generative AI, offering a unified framework for diverse multimodal applications.

How It Works

Lumina-mGPT employs a unified autoregressive approach, treating images as sequences of tokens. This allows it to handle various vision-language tasks, including image generation, image understanding, and image-to-image translation, within a single model architecture. The system leverages a VQ-VAE decoder and is built upon the xllmx module, an evolution of LLaMA2-Accessory, to support LLM-centered multimodal capabilities.

Quick Start & Requirements

Installation: Requires manual installation of the xllmx module. Detailed instructions are in INSTALL.md.
Prerequisites: Requires manual download of VQ-VAE weights from Meta and placement in lumina_mgpt/ckpts/chameleon/.
Demos: Three Gradio demos are available for image generation, image-to-image tasks, and freeform interaction.
- Image Generation: python -u demos/demo_image_generation.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768 --target_size 768
- Image2Image: python -u demos/demo_image2image.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni --target_size 768
- Freeform: python -u demos/demo_freeform.py --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni --target_size 768
Inference: Can be performed using the FlexARInferenceSolver class. Example usage provided for image generation, understanding, and omni-potent tasks.
Links:
- Paper: https://arxiv.org/abs/2408.02657
- Demos: http://106.14.2.150:10020/, http://106.14.2.150:10021/

Highlighted Details

Supports flexible photorealistic text-to-image generation.
Unified framework for image generation, understanding, and image-to-image tasks.
Offers 7B and 34B parameter models with varying resolutions (512, 768, 1024).
Training code and documentation are released.

Maintenance & Community

Project actively releasing code and models.
Hiring for research positions; contact gaopengcuhk@gmail.com.

Licensing & Compatibility

License not explicitly stated in the README.

Limitations & Caveats

The Chameleon implementation in transformers requires manual VQ-VAE weight downloads.
Specific target_size arguments must match the checkpoint used.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

4 stars in the last 30 days

Explore Similar Projects

ShareGPT-4o-Image by FreedomIntelligence

Dataset and model for GPT-4o-level image generation

Created 6 months ago

Updated 5 months ago

OmniGen2 by VectorSpaceLab

Multimodal generation for text and images

Created 7 months ago

Updated 1 month ago

OneDiffusion by lehduong

Versatile diffusion model for bidirectional image synthesis and understanding (CVPR 2025 paper)

Created 1 year ago

Updated 1 year ago

VLM-Visualizer by zjysteven

Visualizing attention in vision-language models

Created 1 year ago

Updated 10 months ago

UltraPixel by catcathh

Research paper implementation for ultra-high-resolution image synthesis

Created 1 year ago

Updated 1 year ago

Starred by

Jeffrey Morgan

Jeffrey Morgan(Cofounder of Ollama).

Liquid by FoundationVision

Multimodal generation research paper

Created 1 year ago

Updated 2 months ago

Thyme by yfzhang114

Multimodal reasoning and code execution for complex visual tasks

Created 4 months ago

Updated 3 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

Multimodal LLM for generating/retrieving images and generating text

Created 2 years ago

Updated 2 years ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Andreas Jansson

Andreas Jansson(Cofounder of Replicate), and

1 more.

Emu3 by baaivision

Multimodal model for vision-language understanding and generation

Created 1 year ago

Updated 1 month ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen) and

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

Qwen-Image by QwenLM

Image generation model with advanced text rendering

Created 5 months ago

Updated 1 week ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Bagel by ByteDance-Seed

Unified multimodal foundation model

Created 8 months ago

Updated 2 months ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

7 more.

Janus by deepseek-ai

Unified multimodal model research paper for understanding and generation

Created 1 year ago

Updated 11 months ago

Feedback? Help us improve.