StableCascade  by Stability-AI

Image generation model using cascaded diffusion

Created 2 years ago
6,572 stars

Top 7.7% on SourcePulse

GitHubView on GitHub
Project Summary

Stable Cascade is an official codebase for a text-to-image diffusion model built on the Würstchen architecture, targeting researchers and developers seeking efficient and high-quality image generation. It achieves significant speedups and reduced training costs by operating in a highly compressed latent space (42x compression factor), outperforming models like Stable Diffusion XL in prompt alignment and aesthetic quality.

How It Works

Stable Cascade employs a three-stage cascade: Stage A (VAE) and Stage B compress images into a small 24x24 latent space, while Stage C (diffusion model) generates these latents from text prompts. This approach allows for faster inference and cheaper training compared to models with larger latent spaces, while maintaining high-fidelity reconstructions.

Quick Start & Requirements

  • Install via pip install gradio accelerate and pip install git+https://github.com/kashif/diffusers.git@wuerstchen-v3.
  • Run the Gradio app with PYTHONPATH=./ python3 gradio_app/app.py.
  • Official documentation and usage examples are available in the 🤗 diffusers library.

Highlighted Details

  • Achieves superior prompt alignment and aesthetic quality in human evaluations against models like SDXL and Playground v2.
  • Offers faster inference times despite a larger parameter count than SDXL.
  • Supports extensions like finetuning, LoRA, ControlNet, IP-Adapter, and LCM.
  • Provides training scripts for the model, ControlNet, and LoRA.
  • Includes a diffusion autoencoder (Stage A & B) for custom model training in a compressed space.

Maintenance & Community

The codebase is in early development, with potential for future updates and optimizations based on community interest. Feedback and contributions are welcomed.

Licensing & Compatibility

The code is released under the MIT LICENSE. Model weights are under a STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE, restricting commercial use.

Limitations & Caveats

The codebase is in early development and may contain errors or unoptimized code. The model weights are restricted to non-commercial and research community use.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

RPG-DiffusionMaster by YangLing0818

0%
2k
Training-free paradigm for text-to-image generation/editing
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
1 more.

Sana by NVlabs

0.3%
5k
Image synthesis research paper using a linear diffusion transformer
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
12 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 3 years ago
Updated 2 years ago
Feedback? Help us improve.