Sana  by NVlabs

Image synthesis research paper using a linear diffusion transformer

Created 11 months ago
4,498 stars

Top 11.0% on SourcePulse

GitHubView on GitHub
Project Summary

Sana is a text-to-image generation framework designed for efficient, high-resolution image synthesis. It targets researchers and content creators seeking fast, high-quality image generation with strong text-image alignment, even on consumer hardware. The core benefit is achieving state-of-the-art results with significantly reduced computational requirements and faster inference times compared to larger models.

How It Works

Sana employs a novel architecture combining a 32x downsampling Deep Convolutional Autoencoder (DC-AE) to reduce latent token count, and a Linear Diffusion Transformer (Linear DiT) that replaces standard attention with linear attention for efficiency at high resolutions. It also utilizes a decoder-only LLM as a text encoder, enhanced with instruction tuning for improved image-text alignment. For faster sampling, it introduces Flow-DPM-Solver, reducing inference steps.

Quick Start & Requirements

  • Installation: Clone the repository and run ./environment_setup.sh sana or install components manually.
  • Prerequisites: Python >= 3.10.0, PyTorch >= 2.0.1+cu12.1.
  • Hardware: 9GB VRAM for 0.6B models, 12GB VRAM for 1.6B models for inference. Training requires 32GB VRAM. Quantized versions can run on <8GB VRAM.
  • Demos & Docs: Online demo available at https://nv-sana.mit.edu/. diffusers integration: SanaPipeline, SanaPAGPipeline. ComfyUI nodes: ComfyUI_ExtraModels.

Highlighted Details

  • Achieves 2K and 4K resolution image generation.
  • Supports ControlNet for guided generation.
  • Enables Dreambooth and LoRA fine-tuning.
  • Offers 8-bit and 4-bit quantization for reduced VRAM usage.
  • Claims up to 100x faster throughput and 20x smaller model size than comparable large models (e.g., Flux-12B).
  • SANA-Sprint models achieve 1-4 step generation.

Maintenance & Community

The project is actively developed by NVlabs, with recent updates in March 2025 including SANA-Sprint release and SANA-1.5 updates. Community support and integration are evident through active diffusers and ComfyUI contributions.

Licensing & Compatibility

The codebase license was changed to Apache 2.0 on January 11, 2025. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

While highly efficient, the README notes that specific GPU versions may yield different performance metrics. The project is under active development, with some features like video generation listed under "TODO".

Health Check
Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
4
Star History
69 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
11 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 2 years ago
Updated 1 year ago
Starred by Robin Huang Robin Huang(Cofounder of Comfy Org), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
17 more.

stablediffusion by Stability-AI

0.1%
42k
Latent diffusion model for high-resolution image synthesis
Created 2 years ago
Updated 2 months ago
Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
57 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.