Kandinsky-2 by ai-forever

Multilingual text-to-image latent diffusion model

Created 3 years ago

2,819 stars

Top 16.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Robin Rombach

Cofounder of Black Forest Labs

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Omar Sanseviero

DevRel at Google DeepMind

Jiaming Song

Chief Scientist at Luma AI

Project Summary

Kandinsky 2.2 is a multilingual text-to-image diffusion model offering enhanced aesthetic quality and text comprehension through a new CLIP-ViT-G image encoder and ControlNet support for guided image generation. It targets researchers and developers seeking advanced text-to-image capabilities with fine-grained control.

How It Works

Kandinsky 2.2 employs a latent diffusion architecture with a powerful CLIP-ViT-G image encoder (1.8B parameters) and a U-Net diffusion model (1.22B parameters). It utilizes a diffusion image prior to map text embeddings to image embeddings, enabling text-to-image and image-to-image generation. ControlNet integration allows for precise control over image generation using additional conditions like depth maps.

Quick Start & Requirements

Install via pip: pip install "git+https://github.com/ai-forever/Kandinsky-2.git"
Requires CUDA-enabled GPU.
Example usage and notebooks are available in the ./notebooks folder.
Official Demo: fusionbrain.ai

Highlighted Details

Supports text-to-image, image-to-image, and inpainting.
Version 2.2 introduces CLIP-ViT-G for improved aesthetics and text understanding.
ControlNet support enables precise image generation control.
Multilingual capabilities via XLM-Roberta-Large-Vit-L-14 text encoder.

Maintenance & Community

Developed by ai-forever.
Key contributors include Arseniy Shakhmatov, Anton Razzhigaev, and Aleksandr Nikolich.
Links to author GitHubs and blogs are provided.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The specific license is not mentioned, which may impact commercial adoption.
Flash attention is noted as use_flash_attention=False for 2.1, suggesting potential performance optimizations might be available or require specific configuration.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days