unidiffuser  by thu-ml

Unified diffusion framework for multi-modal generation

created 2 years ago
1,430 stars

Top 29.1% on sourcepulse

GitHubView on GitHub
Project Summary

UniDiffuser is a unified diffusion framework designed to handle multiple data modalities (image, text) within a single model. It addresses the challenge of training separate diffusion models for marginal, conditional, and joint distributions by unifying them as a single noise prediction task. This approach benefits researchers and practitioners working with multi-modal generative AI who seek a versatile and efficient solution.

How It Works

UniDiffuser employs a Transformer-based architecture (U-ViT) to parameterize the diffusion model. The core innovation lies in perturbing data across all modalities simultaneously and inputting modality-specific timesteps. The model then predicts the noise for all perturbed modalities. This unified approach, leveraging a shared Transformer backbone, allows for efficient simultaneous learning of image, text, text-to-image, image-to-text, and joint image-text generation without requiring separate models or significant architectural modifications.

Quick Start & Requirements

  • Install: Use conda to create an environment and pip to install dependencies. Key packages include torch, accelerate, transformers, clip, and optionally xformers and triton for performance.
  • Prerequisites: Python 3.9, CUDA 11.6 (for PyTorch), and a GPU with at least 10GB VRAM are recommended.
  • Pretrained Models: Download autoencoder_kl.pth, caption_decoder.pth, and uvit_v0.pth or uvit_v1.pth from Hugging Face and place them in a models directory.
  • Inference: Run sample_multi_v1.py (or sample_multi_v0.py) with specified modes like t2i, i2t, joint, etc.
  • Diffusers Integration: Available via UniDiffuserPipeline in the diffusers library.
  • Documentation: Official UniDiffuser documentation

Highlighted Details

  • Achieves comparable or superior quantitative results (FID, CLIP score) to specialized models in tasks like text-to-image generation.
  • Supports six generation modes: text-to-image, image-to-text, joint generation, image-only, text-only, image variation, and text variation.
  • Utilizes a U-ViT backbone, a Stable Diffusion autoencoder, and CLIP encoders.
  • Offers two versions (v0 and v1) trained on LAION-5B and internal datasets, respectively.

Maintenance & Community

  • The project is associated with the authors of the referenced papers.
  • Integration with the Hugging Face diffusers library suggests active community support and adoption.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. The underlying components (e.g., Stable Diffusion autoencoder, CLIP) have their own licenses. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The README does not specify any explicit limitations or known bugs. However, the dependency on specific versions of libraries like torch and accelerate might pose compatibility challenges with newer versions. The project is based on research papers, implying it may be subject to ongoing development and potential breaking changes.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
28 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
created 3 years ago
updated 1 year ago
Feedback? Help us improve.