Framework for text-to-any modality generation
Top 20.9% on sourcepulse
Lumina-T2X is a unified framework for generating content across multiple modalities (images, video, audio, 3D) from text prompts. It targets researchers and developers looking for a flexible, high-resolution, and multi-duration generative model. The framework leverages a Flow-based Large Diffusion Transformer (Flag-DiT) architecture, enabling it to handle diverse data types and output specifications within a single model.
How It Works
Lumina-T2X utilizes a flow matching formulation with a Diffusion Transformer (DiT) backbone, specifically the Flag-DiT. This approach allows for unified encoding of images, videos, 3D data, and audio spectrograms into a single 1D token sequence. By incorporating special tokens like [nextline]
and [nextframe]
, the model achieves resolution and duration extrapolation, generating outputs at resolutions and lengths not seen during training. This unified, flow-based approach aims for faster convergence, stable training, and a simplified generation pipeline.
Quick Start & Requirements
pip install git+https://github.com/Alpha-VLLM/Lumina-T2X
diffusers
(pip install git+https://github.com/huggingface/diffusers
).Highlighted Details
diffusers
library and supports ComfyUI.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
5 months ago
1 day