Lumina-T2X  by Alpha-VLLM

Framework for text-to-any modality generation

Created 1 year ago
2,221 stars

Top 20.4% on SourcePulse

GitHubView on GitHub
Project Summary

Lumina-T2X is a unified framework for generating content across multiple modalities (images, video, audio, 3D) from text prompts. It targets researchers and developers looking for a flexible, high-resolution, and multi-duration generative model. The framework leverages a Flow-based Large Diffusion Transformer (Flag-DiT) architecture, enabling it to handle diverse data types and output specifications within a single model.

How It Works

Lumina-T2X utilizes a flow matching formulation with a Diffusion Transformer (DiT) backbone, specifically the Flag-DiT. This approach allows for unified encoding of images, videos, 3D data, and audio spectrograms into a single 1D token sequence. By incorporating special tokens like [nextline] and [nextframe], the model achieves resolution and duration extrapolation, generating outputs at resolutions and lengths not seen during training. This unified, flow-based approach aims for faster convergence, stable training, and a simplified generation pipeline.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/Alpha-VLLM/Lumina-T2X
  • Diffusers Integration: Requires installing the development version of diffusers (pip install git+https://github.com/huggingface/diffusers).
  • Dependencies: PyTorch, CUDA. Specific model configurations may have varying VRAM requirements.
  • Demos: Interactive demos are available via Hugging Face Spaces and custom web interfaces.
  • Checkpoints: Available on Hugging Face and wisemodel.cn.

Highlighted Details

  • Supports Text-to-Image, Text-to-Video (up to 720P), Text-to-Audio, Text-to-Music, and Text-to-3D/Point Cloud generation.
  • Achieves resolution extrapolation (e.g., 768x768 to 1792x1792) and variable duration generation.
  • Offers compositional generation capabilities for localized control.
  • Integrates with Hugging Face diffusers library and supports ComfyUI.

Maintenance & Community

  • Active development with frequent updates and releases (e.g., Lumina-mGPT, diffusers integration).
  • Core contributors include Dongyang Liu, Le Zhuo, Junlin Xie, Ruoyi Du, and Peng Gao.
  • Community engagement via WeChat groups and Twitter.
  • Papers available on arXiv for Lumina-Next and Lumina-T2X.

Licensing & Compatibility

  • Code License: MIT.
  • Model weights are generally available for research and non-commercial use, with specific licenses potentially varying per model checkpoint.

Limitations & Caveats

  • The project is under rapid development, with frequent code updates that may require users to pull the latest code.
  • While the framework is unified, specific modality generation quality and performance may vary.
  • Some advanced features like training scripts for all modalities might still be under development or require specific configurations.
Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.1%
3k
Audio generation research paper using latent diffusion
Created 2 years ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
11 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.