Self-Forcing  by guandeh17

Video diffusion model training/inference

Created 4 months ago
2,680 stars

Top 17.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository implements "Self-Forcing," a technique to bridge the train-test distribution gap in autoregressive video diffusion models. It enables real-time, streaming video generation with quality comparable to state-of-the-art models, targeting researchers and developers working on advanced video synthesis.

How It Works

Self-Forcing simulates the inference process during training by performing autoregressive rollouts with KV caching. This approach directly addresses the mismatch between how models are trained and how they are used for generation, leading to more stable and efficient inference.

Quick Start & Requirements

  • Installation: Create a conda environment (conda create -n self_forcing python=3.10 -y), activate it (conda activate self_forcing), and install dependencies (pip install -r requirements.txt, pip install flash-attn --no-build-isolation, python setup.py develop).
  • Requirements: Nvidia GPU with >= 24GB memory (RTX 4090, A100, H100 tested), Linux OS, 64GB RAM.
  • Demo: python demo.py
  • Inference: python inference.py --config_path configs/self_forcing_dmd.yaml --output_folder videos/self_forcing_dmd --checkpoint_path checkpoints/self_forcing_dmd.pt --data_path prompts/MovieGenVideoBench_extended.txt --use_ema
  • Models: Download checkpoints from HuggingFace: huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B and huggingface-cli download gdhe17/Self-Forcing checkpoints/self_forcing_dmd.pt --local-dir .
  • Docs: Paper, Website, Models (HuggingFace) links provided in README.

Highlighted Details

  • Enables real-time, streaming video generation on a single RTX 4090.
  • Matches quality of state-of-the-art diffusion models.
  • Training is data-free (except GAN version), requiring only ODE initialization checkpoints.
  • Supports prompt extension via LLMs for better results.

Maintenance & Community

  • Built on top of CausVid and Wan2.1 implementations.
  • Citation details provided for the associated paper.

Licensing & Compatibility

  • License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Model performs better with long, detailed prompts; prompt extension integration is planned.
  • Speed can be further improved with torch.compile, TAEHV-VAE, or FP8 Linear layers, with potential quality trade-offs.
  • Training reproduction requires significant GPU resources (64 H100 GPUs for <2 hours, or 8 H100 GPUs for <16 hours with gradient accumulation).
Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
11
Star History
159 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Hanlin Tang Hanlin Tang(CTO Neural Networks at Databricks; Cofounder of MosaicML), and
1 more.

diffusion by mosaicml

0.1%
709
Diffusion model training code
Created 2 years ago
Updated 9 months ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

FastVideo by hao-ai-lab

1.5%
2k
Framework for accelerated video generation
Created 11 months ago
Updated 14 hours ago
Feedback? Help us improve.