Self-Forcing by guandeh17

Video diffusion model training/inference

Created 5 months ago

2,884 stars

Top 16.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Alberto Taiuti

Cofounder of Luma AI

Project Summary

This repository implements "Self-Forcing," a technique to bridge the train-test distribution gap in autoregressive video diffusion models. It enables real-time, streaming video generation with quality comparable to state-of-the-art models, targeting researchers and developers working on advanced video synthesis.

How It Works

Self-Forcing simulates the inference process during training by performing autoregressive rollouts with KV caching. This approach directly addresses the mismatch between how models are trained and how they are used for generation, leading to more stable and efficient inference.

Quick Start & Requirements

Installation: Create a conda environment (conda create -n self_forcing python=3.10 -y), activate it (conda activate self_forcing), and install dependencies (pip install -r requirements.txt, pip install flash-attn --no-build-isolation, python setup.py develop).
Requirements: Nvidia GPU with >= 24GB memory (RTX 4090, A100, H100 tested), Linux OS, 64GB RAM.
Demo: python demo.py
Inference: python inference.py --config_path configs/self_forcing_dmd.yaml --output_folder videos/self_forcing_dmd --checkpoint_path checkpoints/self_forcing_dmd.pt --data_path prompts/MovieGenVideoBench_extended.txt --use_ema
Models: Download checkpoints from HuggingFace: huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B and huggingface-cli download gdhe17/Self-Forcing checkpoints/self_forcing_dmd.pt --local-dir .
Docs: Paper, Website, Models (HuggingFace) links provided in README.

Highlighted Details

Enables real-time, streaming video generation on a single RTX 4090.
Matches quality of state-of-the-art diffusion models.
Training is data-free (except GAN version), requiring only ODE initialization checkpoints.
Supports prompt extension via LLMs for better results.

Maintenance & Community

Built on top of CausVid and Wan2.1 implementations.
Citation details provided for the associated paper.

Licensing & Compatibility

License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Model performs better with long, detailed prompts; prompt extension integration is planned.
Speed can be further improved with torch.compile, TAEHV-VAE, or FP8 Linear layers, with potential quality trade-offs.
Training reproduction requires significant GPU resources (64 H100 GPUs for <2 hours, or 8 H100 GPUs for <16 hours with gradient accumulation).

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

128 stars in the last 30 days