Tuning-free method for multi-prompt video generation using a diffusion transformer
Top 92.8% on sourcepulse
DiTCtrl offers a tuning-free method for generating coherent longer videos from multiple sequential text prompts using the MM-DiT architecture. It targets researchers and practitioners in AI video generation seeking to improve prompt adherence and temporal consistency without extensive retraining. The primary benefit is achieving smooth transitions and consistent object motion across complex, multi-part narratives.
How It Works
DiTCtrl leverages the attention mechanisms within the MM-DiT architecture, treating multi-prompt video generation as temporal video editing. By analyzing and reweighting specific attention maps between text and video tokens, it enables precise semantic control. This approach allows for smooth transitions and consistent object motion across sequential prompts without requiring additional training, effectively enabling prompt-to-prompt style editing capabilities within the diffusion model.
Quick Start & Requirements
xformers
(0.0.28.post1).bash run_multi_prompt.sh
.Highlighted Details
Maintenance & Community
The project is associated with researchers from The Chinese University of Hong Kong and Tencent ARC Lab. The codebase builds upon CogVideoX, MasaCtrl, MimicMotion, FreeNoise, and prompt-to-prompt.
Licensing & Compatibility
Released under a custom LICENSE file. Compatibility for commercial use or closed-source linking would require reviewing this specific license.
Limitations & Caveats
The setup requires specific versions of PyTorch and CUDA, and downloading large model weights. The project is based on the CogVideoX architecture, and a Diffuser version is listed as a future task.
4 months ago
1 week