Video-to-video translation framework for zero-shot text-guided video rendering
Top 16.4% on sourcepulse
This project provides a zero-shot, text-guided video-to-video translation framework for researchers and artists. It addresses the challenge of maintaining temporal consistency in video generation by leveraging adapted diffusion models, enabling users to restyle videos based on text prompts without retraining.
How It Works
The framework consists of two main stages: key frame translation and full video translation. Key frames are generated using a diffusion model enhanced with hierarchical cross-frame constraints to ensure coherence in shape, texture, and color. Subsequent frames are then propagated from these key frames using temporal-aware patch matching and frame blending techniques. This approach allows for global style and local texture consistency with minimal computational cost.
Quick Start & Requirements
--recursive
and run pip install -r requirements.txt
or use the provided environment.yml
.python rerender.py --cfg config/real2sculpture.json
Highlighted Details
Maintenance & Community
The project was accepted to SIGGRAPH Asia 2023 and has been integrated into Hugging Face Diffusers. Updates include Loose cross-frame attention and FreeU integration.
Licensing & Compatibility
The repository is released under the MIT License, permitting commercial use and linking with closed-source projects.
Limitations & Caveats
The primary requirement is 24GB of VRAM, though memory reduction techniques are suggested. Installation on Windows may require manual setup of CUDA, Git, and Visual Studio with the Windows SDK, and pre-compiled binaries for ebsynth are provided as a fallback. Path names should only contain English letters or underscores to avoid FileNotFoundError
.
1 year ago
1 day