DiffiT  by NVlabs

Diffusion Vision Transformers for high-fidelity image generation

Created 3 years ago
518 stars

Top 60.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

DiffiT addresses high-fidelity image generation by combining Diffusion Models with Vision Transformers (ViTs). It introduces Time-dependent Multihead Self Attention (TMSA) for precise control over the denoising process across timesteps. This approach targets researchers and engineers in generative AI, offering state-of-the-art performance on class-conditional image synthesis tasks.

How It Works

The core innovation lies in integrating ViTs with diffusion models, enhanced by TMSA. This novel attention mechanism allows for fine-grained, timestep-specific adjustments during the denoising pipeline. This architectural choice enables DiffiT to achieve superior control and generation quality compared to prior methods.

Quick Start & Requirements

The repository provides official PyTorch code and pretrained model checkpoints for DiffiT. Image sampling is initiated via sample.py, with example commands for ImageNet-256 and ImageNet-512 resolutions, requiring configuration of log directories and model paths. Evaluation of generated images, including FID scores, is handled by eval_run.sh, mirroring the openai/guided-diffusion evaluation protocol. Ready-to-use Slurm scripts are also available for batch processing.

Highlighted Details

  • Achieves state-of-the-art (SOTA) performance on class-conditional ImageNet generation.
  • Reports an FID score of 1.73 for ImageNet-256 and 2.67 for ImageNet-512.
  • Generates images at resolutions up to 512x512.
  • Achieves an Inception Score of 276.49 on ImageNet-256.

Maintenance & Community

The project is an official release from NVIDIA Research, with code and pretrained models made available on March 8, 2026. It was accepted to ECCV 2024. No community channels (e.g., Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The source code is released under the NVIDIA Source Code License-NC, which restricts commercial use. Pre-trained models are shared under the CC-BY-NC-SA-4.0 license, requiring any derivative works to be distributed under the same non-commercial, share-alike terms.

Limitations & Caveats

The primary limitation is the non-commercial (NC) nature of both the source code and pre-trained model licenses, preventing integration into commercial products or services. Derivative works must adhere to the same restrictive licensing.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
1 more.

Sana by NVlabs

0.3%
5k
Image synthesis research paper using a linear diffusion transformer
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.