musubi-tuner by kohya-ss

LoRA training/inference scripts for video diffusion models

Created 1 year ago

1,613 stars

Top 25.9% on SourcePulse

Project Summary

This repository provides scripts for training and inference of LoRA models for video generation using HunyuanVideo, Wan2.1, and FramePack architectures. It is an unofficial project aimed at researchers and power users interested in fine-tuning video diffusion models. The primary benefit is enabling custom LoRA training for these advanced video generation models.

How It Works

Musubi Tuner leverages Low-Rank Adaptation (LoRA) for efficient fine-tuning of large video diffusion models. It supports multiple architectures including HunyuanVideo, Wan2.1, and FramePack, offering flexibility in model choice. The project emphasizes pre-caching of latents and text encoder outputs to optimize training performance and reduce VRAM usage. It also incorporates features like PyTorch Dynamo optimization and various attention mechanisms (SDPA, FlashAttention, xformers, SageAttention) for speed and memory efficiency.

Quick Start & Requirements

Installation: pip install -r requirements.txt or experimental uv installation.
Prerequisites: Python 3.10+, PyTorch 2.5.1+ with matching CUDA version (e.g., cu124). Recommended VRAM: 12GB for image training, 24GB+ for video training. Recommended RAM: 64GB+.
Setup: Requires downloading base models (HunyuanVideo, Wan2.1, FramePack) and configuring dataset paths via TOML files. Latent and text encoder output pre-caching are mandatory steps before training.
Docs: Dataset Configuration Guide, Training Guide, Inference Guide.

Highlighted Details

Supports training and inference for HunyuanVideo, Wan2.1, and FramePack.
Offers experimental support for PyTorch Dynamo optimization for faster training.
Includes options for memory-saving techniques like FP8, blocks_to_swap, and various attention mechanisms.
Provides scripts for LoRA merging and conversion to ComfyUI compatible formats.

Maintenance & Community

Active development with recent updates in April 2025 adding FramePack support and batch inference modes.
GitHub Discussions are enabled for community Q&A.
Project is supported via GitHub Sponsors.

Licensing & Compatibility

Primarily licensed under Apache License 2.0.
Code modified from HunyuanVideo follows its license. Code modified from Wan2.1 and FramePack also follows Apache 2.0.
Compatible with commercial use under Apache 2.0 terms.

Limitations & Caveats

This project is unofficial, experimental, and under active development, meaning features and APIs may change without notice. Video training features are still under development, and some functionalities may not work as expected. Production use is not recommended.

musubi-tuner by kohya-ss

Explore Similar Projects

LLaMA-VID by JIA-Lab-research

Chatglm_lora_multi-gpu by liangwq

ComfyUI-CogVideoXWrapper by kijai

sd-webui-text2video by kabachuha

EasyAnimate by aigc-apps

HunyuanVideo-I2V by Tencent-Hunyuan

FastVideo by hao-ai-lab

Pyramid-Flow by jy0205

ComfyUI-WanVideoWrapper by kijai

ai-toolkit by ostris

HunyuanVideo by Tencent-Hunyuan

LLaVA by haotian-liu