unidiffuser by thu-ml

Unified diffusion framework for multi-modal generation

Created 2 years ago

1,466 stars

Top 27.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Suhail Doshi

Founder of Mixpanel

Jiaming Song

Chief Scientist at Luma AI

Project Summary

UniDiffuser is a unified diffusion framework designed to handle multiple data modalities (image, text) within a single model. It addresses the challenge of training separate diffusion models for marginal, conditional, and joint distributions by unifying them as a single noise prediction task. This approach benefits researchers and practitioners working with multi-modal generative AI who seek a versatile and efficient solution.

How It Works

UniDiffuser employs a Transformer-based architecture (U-ViT) to parameterize the diffusion model. The core innovation lies in perturbing data across all modalities simultaneously and inputting modality-specific timesteps. The model then predicts the noise for all perturbed modalities. This unified approach, leveraging a shared Transformer backbone, allows for efficient simultaneous learning of image, text, text-to-image, image-to-text, and joint image-text generation without requiring separate models or significant architectural modifications.

Quick Start & Requirements

Install: Use conda to create an environment and pip to install dependencies. Key packages include torch, accelerate, transformers, clip, and optionally xformers and triton for performance.
Prerequisites: Python 3.9, CUDA 11.6 (for PyTorch), and a GPU with at least 10GB VRAM are recommended.
Pretrained Models: Download autoencoder_kl.pth, caption_decoder.pth, and uvit_v0.pth or uvit_v1.pth from Hugging Face and place them in a models directory.
Inference: Run sample_multi_v1.py (or sample_multi_v0.py) with specified modes like t2i, i2t, joint, etc.
Diffusers Integration: Available via UniDiffuserPipeline in the diffusers library.
Documentation: Official UniDiffuser documentation

Highlighted Details

Achieves comparable or superior quantitative results (FID, CLIP score) to specialized models in tasks like text-to-image generation.
Supports six generation modes: text-to-image, image-to-text, joint generation, image-only, text-only, image variation, and text variation.
Utilizes a U-ViT backbone, a Stable Diffusion autoencoder, and CLIP encoders.
Offers two versions (v0 and v1) trained on LAION-5B and internal datasets, respectively.

Maintenance & Community

The project is associated with the authors of the referenced papers.
Integration with the Hugging Face diffusers library suggests active community support and adoption.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The underlying components (e.g., Stable Diffusion autoencoder, CLIP) have their own licenses. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The README does not specify any explicit limitations or known bugs. However, the dependency on specific versions of libraries like torch and accelerate might pose compatibility challenges with newer versions. The project is based on research papers, implying it may be subject to ongoing development and potential breaking changes.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days