MDT  by sail-sg

Image synthesis research paper (ICCV 2023)

Created 2 years ago
580 stars

Top 55.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for Masked Diffusion Transformer (MDT) and its improved version, MDTv2, which achieves state-of-the-art image synthesis performance. It is designed for researchers and practitioners in computer vision and generative modeling looking to advance image synthesis quality and training efficiency.

How It Works

MDT addresses the limited contextual reasoning in diffusion models by introducing a mask latent modeling scheme. It operates in the latent space, masking certain tokens and using an asymmetric diffusion transformer to predict these masked tokens from unmasked ones. This approach enhances the model's ability to learn relationships among object semantic parts, enabling reconstruction of full images from incomplete contextual inputs. MDTv2 further optimizes this with a more efficient macro network structure and training strategy, leading to faster convergence and stronger performance.

Quick Start & Requirements

  • Install: pip install -e . and pip install git+https://github.com/sail-sg/Adan.git
  • Prerequisites: PyTorch >= 2.0, Adan optimizer. Requires ImageNet dataset for training/evaluation.
  • Pretrained Models: Available on Hugging Face (shgao/MDT-XL2).
  • Demo: https://huggingface.co/spaces/shgao/MDT

Highlighted Details

  • Achieves SOTA FID score of 1.58 on ImageNet 256x256 with MDTv2-XL/2.
  • MDTv2 offers >10x faster learning speed compared to previous SOTA DiT.
  • MDTv2 demonstrates a 5x acceleration over the original MDT.
  • Codebase built upon DiT and ADM.

Maintenance & Community

  • Contributors: Shanghua Gao, Pan Zhou, Ming-Ming Cheng, Shuicheng Yan.
  • Acknowledgements: DiT and ADM projects.

Licensing & Compatibility

  • License: Not explicitly stated in the README. The project is presented as an official codebase, implying open-source availability, but specific license terms are absent. Compatibility for commercial use or closed-source linking requires clarification.

Limitations & Caveats

  • The README does not specify a license, creating uncertainty for commercial use or integration into closed-source projects.
  • Evaluation setup requires following instructions in the evaluations folder, suggesting a multi-step process beyond the core repository.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Robin Rombach Robin Rombach(Cofounder of Black Forest Labs), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

Kandinsky-2 by ai-forever

0.0%
3k
Multilingual text-to-image latent diffusion model
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.