Multilingual text-to-image latent diffusion model
Top 17.3% on sourcepulse
Kandinsky 2.2 is a multilingual text-to-image diffusion model offering enhanced aesthetic quality and text comprehension through a new CLIP-ViT-G image encoder and ControlNet support for guided image generation. It targets researchers and developers seeking advanced text-to-image capabilities with fine-grained control.
How It Works
Kandinsky 2.2 employs a latent diffusion architecture with a powerful CLIP-ViT-G image encoder (1.8B parameters) and a U-Net diffusion model (1.22B parameters). It utilizes a diffusion image prior to map text embeddings to image embeddings, enabling text-to-image and image-to-image generation. ControlNet integration allows for precise control over image generation using additional conditions like depth maps.
Quick Start & Requirements
pip install "git+https://github.com/ai-forever/Kandinsky-2.git"
./notebooks
folder.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
use_flash_attention=False
for 2.1, suggesting potential performance optimizations might be available or require specific configuration.1 year ago
Inactive