Self-Forcing  by guandeh17

Video diffusion model training/inference

created 1 month ago
2,336 stars

Top 20.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository implements "Self-Forcing," a technique to bridge the train-test distribution gap in autoregressive video diffusion models. It enables real-time, streaming video generation with quality comparable to state-of-the-art models, targeting researchers and developers working on advanced video synthesis.

How It Works

Self-Forcing simulates the inference process during training by performing autoregressive rollouts with KV caching. This approach directly addresses the mismatch between how models are trained and how they are used for generation, leading to more stable and efficient inference.

Quick Start & Requirements

  • Installation: Create a conda environment (conda create -n self_forcing python=3.10 -y), activate it (conda activate self_forcing), and install dependencies (pip install -r requirements.txt, pip install flash-attn --no-build-isolation, python setup.py develop).
  • Requirements: Nvidia GPU with >= 24GB memory (RTX 4090, A100, H100 tested), Linux OS, 64GB RAM.
  • Demo: python demo.py
  • Inference: python inference.py --config_path configs/self_forcing_dmd.yaml --output_folder videos/self_forcing_dmd --checkpoint_path checkpoints/self_forcing_dmd.pt --data_path prompts/MovieGenVideoBench_extended.txt --use_ema
  • Models: Download checkpoints from HuggingFace: huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B and huggingface-cli download gdhe17/Self-Forcing checkpoints/self_forcing_dmd.pt --local-dir .
  • Docs: Paper, Website, Models (HuggingFace) links provided in README.

Highlighted Details

  • Enables real-time, streaming video generation on a single RTX 4090.
  • Matches quality of state-of-the-art diffusion models.
  • Training is data-free (except GAN version), requiring only ODE initialization checkpoints.
  • Supports prompt extension via LLMs for better results.

Maintenance & Community

  • Built on top of CausVid and Wan2.1 implementations.
  • Citation details provided for the associated paper.

Licensing & Compatibility

  • License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Model performs better with long, detailed prompts; prompt extension integration is planned.
  • Speed can be further improved with torch.compile, TAEHV-VAE, or FP8 Linear layers, with potential quality trade-offs.
  • Training reproduction requires significant GPU resources (64 H100 GPUs for <2 hours, or 8 H100 GPUs for <16 hours with gradient accumulation).
Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
7
Star History
2,571 stars in the last 90 days

Explore Similar Projects

Starred by Chenlin Meng Chenlin Meng(Cofounder of Pika), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
1 more.

Tune-A-Video by showlab

0%
4k
Text-to-video generation via diffusion model fine-tuning
created 2 years ago
updated 1 year ago
Feedback? Help us improve.