Zero-shot video editor (ICCV 2023 Oral) using attention fusion
Top 34.3% on sourcepulse
FateZero is a zero-shot framework for text-driven video editing, enabling users to modify videos based on textual prompts without requiring per-prompt training or manual masking. It leverages pre-trained diffusion models to achieve consistent structural and motion changes, making it suitable for researchers and practitioners interested in advanced video manipulation.
How It Works
FateZero fuses intermediate attention maps captured during the diffusion model's inversion process to preserve structural and motion information. It further minimizes semantic leakage by blending self-attentions with cross-attention features from the source prompt. A spatial-temporal attention mechanism is introduced into the denoising UNet to ensure frame-to-frame consistency.
Quick Start & Requirements
conda
and pip install -r requirements.txt
.xformers
(recommended for A100/3090 GPUs).Highlighted Details
Maintenance & Community
The project is actively maintained as a codebase for research work. Feedback and discussions are welcomed via GitHub issues. Contact information for key contributors is provided.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README notes that xformers
installation can be unstable. While low-cost settings for 16GB GPUs are provided, performance benchmarks for broader hardware configurations are still being developed.
2 years ago
1 week