DiTCtrl by TencentARC

Tuning-free method for multi-prompt video generation using a diffusion transformer

Created 1 year ago

320 stars

Top 84.9% on SourcePulse

Project Summary

DiTCtrl offers a tuning-free method for generating coherent longer videos from multiple sequential text prompts using the MM-DiT architecture. It targets researchers and practitioners in AI video generation seeking to improve prompt adherence and temporal consistency without extensive retraining. The primary benefit is achieving smooth transitions and consistent object motion across complex, multi-part narratives.

How It Works

DiTCtrl leverages the attention mechanisms within the MM-DiT architecture, treating multi-prompt video generation as temporal video editing. By analyzing and reweighting specific attention maps between text and video tokens, it enables precise semantic control. This approach allows for smooth transitions and consistent object motion across sequential prompts without requiring additional training, effectively enabling prompt-to-prompt style editing capabilities within the diffusion model.

Quick Start & Requirements

Install: Requires Python 3.10, PyTorch 2.4.1 with CUDA 12.1 support, and xformers (0.0.28.post1).
Prerequisites: Download CogVideoX-2B model weights (VAE and Transformer components) and a T5 text encoder model.
Setup: Environment setup involves conda environment creation, package installation, and model weight arrangement.
Run: Execute generation via shell scripts like bash run_multi_prompt.sh.
Details: CogVideoX

Highlighted Details

First tuning-free approach for multi-prompt longer video generation based on MM-DiT.
Enables video editing tasks like word swapping and reweighting via attention map manipulation.
Introduces MPVBench, a new benchmark for evaluating multi-prompt video generation.
Code released for CVPR 2025 paper.

Maintenance & Community

The project is associated with researchers from The Chinese University of Hong Kong and Tencent ARC Lab. The codebase builds upon CogVideoX, MasaCtrl, MimicMotion, FreeNoise, and prompt-to-prompt.

Licensing & Compatibility

Released under a custom LICENSE file. Compatibility for commercial use or closed-source linking would require reviewing this specific license.

Limitations & Caveats

The setup requires specific versions of PyTorch and CUDA, and downloading large model weights. The project is based on the CogVideoX architecture, and a Diffuser version is listed as a future task.

DiTCtrl by TencentARC

Explore Similar Projects

BindWeave by bytedance

RAVE by RehgLab

FreeNoise by AILab-CVC

Video-P2P by JIA-Lab-research

Gen-L-Video by G-U-N

kandinsky-5 by kandinskylab

Allegro by rhymes-ai

Awesome-Video-Diffusion-Models by ChenHsing

MAGI-1 by SandAI-org

Tune-A-Video by showlab

stable-diffusion-videos by nateraw

mochi by genmoai