StableCascade  by Stability-AI

Image generation model using cascaded diffusion

Created 1 year ago
6,587 stars

Top 7.8% on SourcePulse

GitHubView on GitHub
Project Summary

Stable Cascade is an official codebase for a text-to-image diffusion model built on the Würstchen architecture, targeting researchers and developers seeking efficient and high-quality image generation. It achieves significant speedups and reduced training costs by operating in a highly compressed latent space (42x compression factor), outperforming models like Stable Diffusion XL in prompt alignment and aesthetic quality.

How It Works

Stable Cascade employs a three-stage cascade: Stage A (VAE) and Stage B compress images into a small 24x24 latent space, while Stage C (diffusion model) generates these latents from text prompts. This approach allows for faster inference and cheaper training compared to models with larger latent spaces, while maintaining high-fidelity reconstructions.

Quick Start & Requirements

  • Install via pip install gradio accelerate and pip install git+https://github.com/kashif/diffusers.git@wuerstchen-v3.
  • Run the Gradio app with PYTHONPATH=./ python3 gradio_app/app.py.
  • Official documentation and usage examples are available in the 🤗 diffusers library.

Highlighted Details

  • Achieves superior prompt alignment and aesthetic quality in human evaluations against models like SDXL and Playground v2.
  • Offers faster inference times despite a larger parameter count than SDXL.
  • Supports extensions like finetuning, LoRA, ControlNet, IP-Adapter, and LCM.
  • Provides training scripts for the model, ControlNet, and LoRA.
  • Includes a diffusion autoencoder (Stage A & B) for custom model training in a compressed space.

Maintenance & Community

The codebase is in early development, with potential for future updates and optimizations based on community interest. Feedback and contributions are welcomed.

Licensing & Compatibility

The code is released under the MIT LICENSE. Model weights are under a STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE, restricting commercial use.

Limitations & Caveats

The codebase is in early development and may contain errors or unoptimized code. The model weights are restricted to non-commercial and research community use.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
1 more.

Sana by NVlabs

0.4%
4k
Image synthesis research paper using a linear diffusion transformer
Created 11 months ago
Updated 5 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
11 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.