Lumina-T2X by Alpha-VLLM

Framework for text-to-any modality generation

Created 1 year ago

2,249 stars

Top 19.9% on SourcePulse

3 Experts Love This Project

jiamings

Chief Scientist at Luma AI

sxyu

Research Scientist at OpenAI; Cofounder of Luma AI

robinjhuang

Cofounder of Comfy Org

Project Summary

Lumina-T2X is a unified framework for generating content across multiple modalities (images, video, audio, 3D) from text prompts. It targets researchers and developers looking for a flexible, high-resolution, and multi-duration generative model. The framework leverages a Flow-based Large Diffusion Transformer (Flag-DiT) architecture, enabling it to handle diverse data types and output specifications within a single model.

How It Works

Lumina-T2X utilizes a flow matching formulation with a Diffusion Transformer (DiT) backbone, specifically the Flag-DiT. This approach allows for unified encoding of images, videos, 3D data, and audio spectrograms into a single 1D token sequence. By incorporating special tokens like [nextline] and [nextframe], the model achieves resolution and duration extrapolation, generating outputs at resolutions and lengths not seen during training. This unified, flow-based approach aims for faster convergence, stable training, and a simplified generation pipeline.

Quick Start & Requirements

Installation: pip install git+https://github.com/Alpha-VLLM/Lumina-T2X
Diffusers Integration: Requires installing the development version of diffusers (pip install git+https://github.com/huggingface/diffusers).
Dependencies: PyTorch, CUDA. Specific model configurations may have varying VRAM requirements.
Demos: Interactive demos are available via Hugging Face Spaces and custom web interfaces.
Checkpoints: Available on Hugging Face and wisemodel.cn.

Highlighted Details

Supports Text-to-Image, Text-to-Video (up to 720P), Text-to-Audio, Text-to-Music, and Text-to-3D/Point Cloud generation.
Achieves resolution extrapolation (e.g., 768x768 to 1792x1792) and variable duration generation.
Offers compositional generation capabilities for localized control.
Integrates with Hugging Face diffusers library and supports ComfyUI.

Maintenance & Community

Active development with frequent updates and releases (e.g., Lumina-mGPT, diffusers integration).
Core contributors include Dongyang Liu, Le Zhuo, Junlin Xie, Ruoyi Du, and Peng Gao.
Community engagement via WeChat groups and Twitter.
Papers available on arXiv for Lumina-Next and Lumina-T2X.

Licensing & Compatibility

Code License: MIT.
Model weights are generally available for research and non-commercial use, with specific licenses potentially varying per model checkpoint.

Limitations & Caveats

The project is under rapid development, with frequent code updates that may require users to pull the latest code.
While the framework is unified, specific modality generation quality and performance may vary.
Some advanced features like training scripts for all modalities might still be under development or require specific configurations.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

11 stars in the last 30 days

Explore Similar Projects

TATS by songweige

PyTorch code for long video generation research paper

Created 3 years ago

Updated 1 year ago

awesome-conditional-content-generation by haofanwang

Curated list for conditional content generation research papers

Created 3 years ago

Updated 1 year ago

YUME by stdstu12

Interactive world generation from text, image, or video

Created 6 months ago

Updated 5 days ago

kandinsky-5 by kandinskylab

Advanced diffusion models for versatile video and image generation

Created 5 months ago

Updated 1 week ago

Awesome-Video-Diffusion-Models by ChenHsing

Survey on video diffusion models

Created 2 years ago

Updated 6 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

EasyAnimate by aigc-apps

Video generator for high-resolution, long AI videos using transformer diffusion

Created 1 year ago

Updated 10 months ago

text2video by bravekingzhang

CLI tool for text-to-video generation

Created 2 years ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM2 by haoheliu

CLI tool for text-conditional audio/music generation

Created 2 years ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI).

LTX-2 by Lightricks

DiT-based audio-video foundation model for generative tasks

Created 1 week ago

Updated 3 days ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

Audio generation research paper using latent diffusion

Created 2 years ago

Updated 6 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Chaoyu Yang

Chaoyu Yang(Founder of Bento), and

12 more.

IF by deep-floyd

Text-to-image model for photorealistic synthesis and language understanding

Created 3 years ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Paras Jain

Paras Jain(Cofounder of Genmo), and

7 more.

Open-Sora by hpcaitech

Video generation initiative for efficient, high-quality video production

Created 1 year ago

Updated 8 months ago

Feedback? Help us improve.