DiffiT by NVlabs

Diffusion Vision Transformers for high-fidelity image generation

Created 3 years ago

517 stars

Top 59.9% on SourcePulse

Project Summary

Summary

DiffiT addresses high-fidelity image generation by combining Diffusion Models with Vision Transformers (ViTs). It introduces Time-dependent Multihead Self Attention (TMSA) for precise control over the denoising process across timesteps. This approach targets researchers and engineers in generative AI, offering state-of-the-art performance on class-conditional image synthesis tasks.

How It Works

The core innovation lies in integrating ViTs with diffusion models, enhanced by TMSA. This novel attention mechanism allows for fine-grained, timestep-specific adjustments during the denoising pipeline. This architectural choice enables DiffiT to achieve superior control and generation quality compared to prior methods.

Quick Start & Requirements

The repository provides official PyTorch code and pretrained model checkpoints for DiffiT. Image sampling is initiated via sample.py, with example commands for ImageNet-256 and ImageNet-512 resolutions, requiring configuration of log directories and model paths. Evaluation of generated images, including FID scores, is handled by eval_run.sh, mirroring the openai/guided-diffusion evaluation protocol. Ready-to-use Slurm scripts are also available for batch processing.

Highlighted Details

Achieves state-of-the-art (SOTA) performance on class-conditional ImageNet generation.
Reports an FID score of 1.73 for ImageNet-256 and 2.67 for ImageNet-512.
Generates images at resolutions up to 512x512.
Achieves an Inception Score of 276.49 on ImageNet-256.

Maintenance & Community

The project is an official release from NVIDIA Research, with code and pretrained models made available on March 8, 2026. It was accepted to ECCV 2024. No community channels (e.g., Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The source code is released under the NVIDIA Source Code License-NC, which restricts commercial use. Pre-trained models are shared under the CC-BY-NC-SA-4.0 license, requiring any derivative works to be distributed under the same non-commercial, share-alike terms.

Limitations & Caveats

The primary limitation is the non-commercial (NC) nature of both the source code and pre-trained model licenses, preventing integration into commercial products or services. Derivative works must adhere to the same restrictive licensing.

DiffiT by NVlabs

Explore Similar Projects

NextFlow by ByteVisionLab

diffusion-4k by zhang0jhon

LinFusion by Huage001

RelayDiffusion by zai-org

GenerativeDiffusionPrior by Feynben

PiD by nv-tlabs

LightningDiT by hustvl

kandinsky-5 by kandinskylab

pytorch-pretrained-BigGAN by huggingface

Sana by NVlabs

KAIR by cszn

guided-diffusion by openai