Lumina-T2X  by Alpha-VLLM

Framework for text-to-any modality generation

Created 2 years ago
2,253 stars

Top 19.6% on SourcePulse

GitHubView on GitHub
Project Summary

Lumina-T2X is a unified framework for generating content across multiple modalities (images, video, audio, 3D) from text prompts. It targets researchers and developers looking for a flexible, high-resolution, and multi-duration generative model. The framework leverages a Flow-based Large Diffusion Transformer (Flag-DiT) architecture, enabling it to handle diverse data types and output specifications within a single model.

How It Works

Lumina-T2X utilizes a flow matching formulation with a Diffusion Transformer (DiT) backbone, specifically the Flag-DiT. This approach allows for unified encoding of images, videos, 3D data, and audio spectrograms into a single 1D token sequence. By incorporating special tokens like [nextline] and [nextframe], the model achieves resolution and duration extrapolation, generating outputs at resolutions and lengths not seen during training. This unified, flow-based approach aims for faster convergence, stable training, and a simplified generation pipeline.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/Alpha-VLLM/Lumina-T2X
  • Diffusers Integration: Requires installing the development version of diffusers (pip install git+https://github.com/huggingface/diffusers).
  • Dependencies: PyTorch, CUDA. Specific model configurations may have varying VRAM requirements.
  • Demos: Interactive demos are available via Hugging Face Spaces and custom web interfaces.
  • Checkpoints: Available on Hugging Face and wisemodel.cn.

Highlighted Details

  • Supports Text-to-Image, Text-to-Video (up to 720P), Text-to-Audio, Text-to-Music, and Text-to-3D/Point Cloud generation.
  • Achieves resolution extrapolation (e.g., 768x768 to 1792x1792) and variable duration generation.
  • Offers compositional generation capabilities for localized control.
  • Integrates with Hugging Face diffusers library and supports ComfyUI.

Maintenance & Community

  • Active development with frequent updates and releases (e.g., Lumina-mGPT, diffusers integration).
  • Core contributors include Dongyang Liu, Le Zhuo, Junlin Xie, Ruoyi Du, and Peng Gao.
  • Community engagement via WeChat groups and Twitter.
  • Papers available on arXiv for Lumina-Next and Lumina-T2X.

Licensing & Compatibility

  • Code License: MIT.
  • Model weights are generally available for research and non-commercial use, with specific licenses potentially varying per model checkpoint.

Limitations & Caveats

  • The project is under rapid development, with frequent code updates that may require users to pull the latest code.
  • While the framework is unified, specific modality generation quality and performance may vary.
  • Some advanced features like training scripts for all modalities might still be under development or require specific configurations.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.0%
3k
Audio generation research paper using latent diffusion
Created 3 years ago
Updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
12 more.

IF by deep-floyd

0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 3 years ago
Updated 2 years ago
Feedback? Help us improve.