PiD  by nv-tlabs

Pixel diffusion decoder for high-resolution latent-to-image generation

Created 3 weeks ago

New!

717 stars

Top 47.4% on SourcePulse

GitHubView on GitHub
Project Summary

PiD is a novel diffusion decoder designed to replace traditional VAE/RAE components in latent-based generative models. It addresses the challenge of efficiently and accurately decoding latent representations into high-resolution images by reformulating the process as a conditional pixel-space diffusion model. This approach allows for direct denoising in high-resolution pixel space, unifying decoding and upsampling into a single, fast generative pass, benefiting researchers and developers working with large-scale image generation models.

How It Works

PiD reformulates the latent-to-pixel decoding task as a conditional pixel-space diffusion model. Instead of a separate decoding and upsampling stage, PiD directly denoises in high-resolution pixel space, producing a super-resolved image in a single generative pass. This unified approach offers a more efficient and potentially higher-quality alternative to traditional VAE/RAE decoders, leveraging the power of diffusion models for precise pixel-level generation.

Quick Start & Requirements

  • Primary Install: pip install -e . after installing utility dependencies, or use conda env create -f environment.yml for a full environment.
  • Prerequisites: PyTorch (with CUDA), transformers>=4.57.x, diffusers>=0.37. Additional dependencies include hydra-core, omegaconf, pyyaml, attrs, einops, loguru, termcolor, fvcore, iopath, wandb, imageio, opencv-python-headless, pandas, safetensors, sentencepiece, boto3, botocore. DINOv2/SigLIP backbones require optional dependencies detailed in docs/dinov2_siglip.md.
  • Resource Footprint: Inference examples demonstrate single-GPU usage.
  • Links: Paper, Project Page, Model Weights, Checkpoints Docs.

Highlighted Details

  • Plug-and-play diffusion decoder designed to replace VAE/RAE decoders.
  • Unifies decoding and upsampling into a single generative module.
  • Supports multiple latent diffusion model backbones including FLUX, FLUX.2, SD3, Z-Image, Z-Image-Turbo, DINOv2, and SigLIP.
  • Offers two decoder variants: 2k (2048px trained) and 2kto4k (up to 4K resolution trained).
  • Provides two inference entry points: from_clean_* (image -> encode -> PiD) and from_ldm_* (text/class -> LDM -> PiD).

Maintenance & Community

The project saw a significant release on May 25, 2026, including the paper, code, and model weights. Upcoming features include PiD options for Qwen-Image and SD-XL, undistilled checkpoints, and training scripts. No specific community channels (e.g., Discord, Slack) or notable sponsorships are mentioned in the README.

Licensing & Compatibility

The PiD codebase is licensed under the Apache License 2.0. This permissive license generally allows for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

The 2kto4k decoder variant is noted to perform worse than the 2k variant at 2048px resolution. Training scripts are planned but not yet released. The DINOv2 and SigLIP backbones require additional setup for their respective Latent Diffusion Models (LDMs) as they do not integrate with the Hugging Face diffusers library directly.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
10
Star History
718 stars in the last 22 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI), and
3 more.

taesd by madebyollin

0.2%
942
Tiny AutoEncoder for Stable Diffusion latents
Created 3 years ago
Updated 4 months ago
Starred by Robin Rombach Robin Rombach(Cofounder of Black Forest Labs), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

Kandinsky-2 by ai-forever

0.0%
3k
Multilingual text-to-image latent diffusion model
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
1 more.

Sana by NVlabs

0.8%
8k
Image synthesis research paper using a linear diffusion transformer
Created 1 year ago
Updated 2 days ago
Starred by Robin Huang Robin Huang(Cofounder of Comfy Org), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
17 more.

stablediffusion by Stability-AI

0%
42k
Latent diffusion model for high-resolution image synthesis
Created 3 years ago
Updated 11 months ago
Feedback? Help us improve.