Lumina-T2X  by Alpha-VLLM

Framework for text-to-any modality generation

created 1 year ago
2,210 stars

Top 20.9% on sourcepulse

GitHubView on GitHub
Project Summary

Lumina-T2X is a unified framework for generating content across multiple modalities (images, video, audio, 3D) from text prompts. It targets researchers and developers looking for a flexible, high-resolution, and multi-duration generative model. The framework leverages a Flow-based Large Diffusion Transformer (Flag-DiT) architecture, enabling it to handle diverse data types and output specifications within a single model.

How It Works

Lumina-T2X utilizes a flow matching formulation with a Diffusion Transformer (DiT) backbone, specifically the Flag-DiT. This approach allows for unified encoding of images, videos, 3D data, and audio spectrograms into a single 1D token sequence. By incorporating special tokens like [nextline] and [nextframe], the model achieves resolution and duration extrapolation, generating outputs at resolutions and lengths not seen during training. This unified, flow-based approach aims for faster convergence, stable training, and a simplified generation pipeline.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/Alpha-VLLM/Lumina-T2X
  • Diffusers Integration: Requires installing the development version of diffusers (pip install git+https://github.com/huggingface/diffusers).
  • Dependencies: PyTorch, CUDA. Specific model configurations may have varying VRAM requirements.
  • Demos: Interactive demos are available via Hugging Face Spaces and custom web interfaces.
  • Checkpoints: Available on Hugging Face and wisemodel.cn.

Highlighted Details

  • Supports Text-to-Image, Text-to-Video (up to 720P), Text-to-Audio, Text-to-Music, and Text-to-3D/Point Cloud generation.
  • Achieves resolution extrapolation (e.g., 768x768 to 1792x1792) and variable duration generation.
  • Offers compositional generation capabilities for localized control.
  • Integrates with Hugging Face diffusers library and supports ComfyUI.

Maintenance & Community

  • Active development with frequent updates and releases (e.g., Lumina-mGPT, diffusers integration).
  • Core contributors include Dongyang Liu, Le Zhuo, Junlin Xie, Ruoyi Du, and Peng Gao.
  • Community engagement via WeChat groups and Twitter.
  • Papers available on arXiv for Lumina-Next and Lumina-T2X.

Licensing & Compatibility

  • Code License: MIT.
  • Model weights are generally available for research and non-commercial use, with specific licenses potentially varying per model checkpoint.

Limitations & Caveats

  • The project is under rapid development, with frequent code updates that may require users to pull the latest code.
  • While the framework is unified, specific modality generation quality and performance may vary.
  • Some advanced features like training scripts for all modalities might still be under development or require specific configurations.
Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
38 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

mflux by filipstrand

0.7%
2k
MLX port of FLUX for local image generation on Macs
created 11 months ago
updated 16 hours ago
Starred by Dan Abramov Dan Abramov(Core Contributor to React), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
28 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
created 3 years ago
updated 1 year ago
Feedback? Help us improve.