kandinsky-5 by kandinskylab

Advanced diffusion models for versatile video and image generation

Created 7 months ago

722 stars

Top 47.6% on SourcePulse

Project Summary

Kandinsky 5.0 provides a family of advanced diffusion models for generating high-quality images and videos from text and image prompts. It targets engineers, researchers, and power users seeking robust AI media generation tools, offering flexible model sizes and capabilities for diverse applications.

How It Works

The system employs a latent diffusion pipeline, leveraging a Diffusion Transformer (DiT) as its core generative backbone. Generation is conditioned on text embeddings derived from Qwen2.5-VL and CLIP models, with video encoding and decoding handled by the HunyuanVideo 3D VAE. Its novelty lies in offering distinct "Pro" (19B) and "Lite" (2B, 6B) model variants, supporting various generation tasks (T2V, I2V, T2I, I2I), and incorporating advanced optimizations like Flow Matching and cross-attention for controllable, high-fidelity outputs.

Quick Start & Requirements

Install: Clone the repository (git clone https://github.com/kandinskylab/kandinsky-5.git), navigate into the directory (cd kandinsky-5), and install dependencies (pip install -r requirements.txt).
Prerequisites: NVIDIA GPU (minimum 12GB VRAM recommended for 5s generation), CUDA (12.8.1 recommended), PyTorch (2.8 recommended). Flash Attention 3 is advised for NVidia Hopper GPUs.
Models: Download specific models using python download_models.py --models <model_name>.
Links: Project Page, Technical Report, 🤗 Diffusers, ComfyUI.

Highlighted Details

Model Family: Offers distinct "Pro" (19B) and "Lite" (2B, 6B) model lines for video and image generation, catering to different quality/resource needs.
Performance Optimizations: Supports Flash Attention (2, 3), SDPA, Sage Attention, Magcache, and Qwen encoder NF4 quantization, enabling generation on GPUs with as little as 12GB VRAM.
Advanced Capabilities: Includes Text-to-Video (T2V), Image-to-Video (I2V), Text-to-Image (T2I), and Image-to-Image (I2I) generation, with options for 5s and 10s video durations and high resolutions (1K+).
Speed Trade-offs: Distilled and no-CFG variants offer significant speedups (up to 6x faster) with minimal quality loss, alongside SFT models for maximum quality.

Maintenance & Community

Extensive core and contributor lists are provided, indicating active development. Beta testing for Kandinsky Video Lite is available via a Telegram bot.

Licensing & Compatibility

The repository's license is not explicitly stated in the README, which may pose a barrier for commercial or specific integration use cases.

Limitations & Caveats

A known bug in the source build for 10-second generation using the NABLA algorithm can produce noisy output, with a workaround provided. Latency benchmarks are specific to high-end hardware (NVIDIA H100, CUDA 12.8.1, PyTorch 2.8).

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

20 stars in the last 30 days