Sana by NVlabs

Image synthesis research paper using a linear diffusion transformer

Created 1 year ago

4,885 stars

Top 10.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Zhiqiang Xie

Coauthor of SGLang

Alex Yu

Research Scientist at OpenAI; Cofounder of Luma AI

Project Summary

Sana is a text-to-image generation framework designed for efficient, high-resolution image synthesis. It targets researchers and content creators seeking fast, high-quality image generation with strong text-image alignment, even on consumer hardware. The core benefit is achieving state-of-the-art results with significantly reduced computational requirements and faster inference times compared to larger models.

How It Works

Sana employs a novel architecture combining a 32x downsampling Deep Convolutional Autoencoder (DC-AE) to reduce latent token count, and a Linear Diffusion Transformer (Linear DiT) that replaces standard attention with linear attention for efficiency at high resolutions. It also utilizes a decoder-only LLM as a text encoder, enhanced with instruction tuning for improved image-text alignment. For faster sampling, it introduces Flow-DPM-Solver, reducing inference steps.

Quick Start & Requirements

Installation: Clone the repository and run ./environment_setup.sh sana or install components manually.
Prerequisites: Python >= 3.10.0, PyTorch >= 2.0.1+cu12.1.
Hardware: 9GB VRAM for 0.6B models, 12GB VRAM for 1.6B models for inference. Training requires 32GB VRAM. Quantized versions can run on <8GB VRAM.
Demos & Docs: Online demo available at https://nv-sana.mit.edu/. diffusers integration: SanaPipeline, SanaPAGPipeline. ComfyUI nodes: ComfyUI_ExtraModels.

Highlighted Details

Achieves 2K and 4K resolution image generation.
Supports ControlNet for guided generation.
Enables Dreambooth and LoRA fine-tuning.
Offers 8-bit and 4-bit quantization for reduced VRAM usage.
Claims up to 100x faster throughput and 20x smaller model size than comparable large models (e.g., Flux-12B).
SANA-Sprint models achieve 1-4 step generation.

Maintenance & Community

The project is actively developed by NVlabs, with recent updates in March 2025 including SANA-Sprint release and SANA-1.5 updates. Community support and integration are evident through active diffusers and ComfyUI contributions.

Licensing & Compatibility

The codebase license was changed to Apache 2.0 on January 11, 2025. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

While highly efficient, the README notes that specific GPU versions may yield different performance metrics. The project is under active development, with some features like video generation listed under "TODO".

Health Check

Last Commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

97 stars in the last 30 days