Tune-A-Video by showlab

Text-to-video generation via diffusion model fine-tuning

Created 3 years ago

4,369 stars

Top 11.1% on SourcePulse

View on GitHub

4 Experts Love This Project

Chenlin Meng

Cofounder of Pika

Yoland Yan

Cofounder of Comfy Org

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

Tune-A-Video enables one-shot fine-tuning of pre-trained text-to-image diffusion models for text-to-video generation. It allows users to adapt models like Stable Diffusion or personalized DreamBooth models to create videos from text prompts, based on a single input video. This is beneficial for researchers and content creators seeking to generate novel video content with specific styles or subjects.

How It Works

The method fine-tunes a pre-trained text-to-image diffusion model using a single video-text pair. It leverages the existing image generation capabilities of models like Stable Diffusion and adapts them to the temporal domain of video. The process involves fine-tuning the UNet component of the diffusion model, allowing it to generate coherent video sequences from text prompts while preserving the style and content of the input video.

Quick Start & Requirements

Install via pip install -r requirements.txt.
Requires pre-trained Stable Diffusion models (e.g., v1-4, v2-1) or personalized DreamBooth models from Hugging Face.
xformers is highly recommended for efficiency.
Training a 24-frame video takes ~10-15 minutes on an A100 GPU.
Colab demo available: [link to Colab demo]
Pre-trained models on Hugging Face: [link to Hugging Face models]

Highlighted Details

Supports fine-tuning on personalized DreamBooth models for subject-specific video generation.
Achieves improved consistency using DDIM inversion.
Demonstrates impressive results in transforming input videos with new text prompts and styles.
Offers a Python API for inference and integration into custom pipelines.

Maintenance & Community

Official implementation of a paper presented at ICCV 2023.
Code builds upon Hugging Face's diffusers library.
No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it relies on Stable Diffusion models, which have their own licenses. Users should verify compatibility with their intended use cases, especially for commercial applications.

Limitations & Caveats

The README does not specify the license for the Tune-A-Video code itself, which may impact commercial use.
Performance and quality are dependent on the base diffusion model and the input video.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days