PiD by nv-tlabs

Pixel diffusion decoder for high-resolution latent-to-image generation

Created 2 months ago

969 stars

Top 37.4% on SourcePulse

Project Summary

PiD is a novel diffusion decoder designed to replace traditional VAE/RAE components in latent-based generative models. It addresses the challenge of efficiently and accurately decoding latent representations into high-resolution images by reformulating the process as a conditional pixel-space diffusion model. This approach allows for direct denoising in high-resolution pixel space, unifying decoding and upsampling into a single, fast generative pass, benefiting researchers and developers working with large-scale image generation models.

How It Works

PiD reformulates the latent-to-pixel decoding task as a conditional pixel-space diffusion model. Instead of a separate decoding and upsampling stage, PiD directly denoises in high-resolution pixel space, producing a super-resolved image in a single generative pass. This unified approach offers a more efficient and potentially higher-quality alternative to traditional VAE/RAE decoders, leveraging the power of diffusion models for precise pixel-level generation.

Quick Start & Requirements

Primary Install: pip install -e . after installing utility dependencies, or use conda env create -f environment.yml for a full environment.
Prerequisites: PyTorch (with CUDA), transformers>=4.57.x, diffusers>=0.37. Additional dependencies include hydra-core, omegaconf, pyyaml, attrs, einops, loguru, termcolor, fvcore, iopath, wandb, imageio, opencv-python-headless, pandas, safetensors, sentencepiece, boto3, botocore. DINOv2/SigLIP backbones require optional dependencies detailed in docs/dinov2_siglip.md.
Resource Footprint: Inference examples demonstrate single-GPU usage.
Links: Paper, Project Page, Model Weights, Checkpoints Docs.

Highlighted Details

Plug-and-play diffusion decoder designed to replace VAE/RAE decoders.
Unifies decoding and upsampling into a single generative module.
Supports multiple latent diffusion model backbones including FLUX, FLUX.2, SD3, Z-Image, Z-Image-Turbo, DINOv2, and SigLIP.
Offers two decoder variants: 2k (2048px trained) and 2kto4k (up to 4K resolution trained).
Provides two inference entry points: from_clean_* (image -> encode -> PiD) and from_ldm_* (text/class -> LDM -> PiD).

Maintenance & Community

The project saw a significant release on May 25, 2026, including the paper, code, and model weights. Upcoming features include PiD options for Qwen-Image and SD-XL, undistilled checkpoints, and training scripts. No specific community channels (e.g., Discord, Slack) or notable sponsorships are mentioned in the README.

Licensing & Compatibility

The PiD codebase is licensed under the Apache License 2.0. This permissive license generally allows for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

The 2kto4k decoder variant is noted to perform worse than the 2k variant at 2048px resolution. Training scripts are planned but not yet released. The DINOv2 and SigLIP backbones require additional setup for their respective Latent Diffusion Models (LDMs) as they do not integrate with the Hugging Face diffusers library directly.

Health Check

Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

177 stars in the last 30 days