DiTCtrl  by TencentARC

Tuning-free method for multi-prompt video generation using a diffusion transformer

created 7 months ago
285 stars

Top 92.8% on sourcepulse

GitHubView on GitHub
Project Summary

DiTCtrl offers a tuning-free method for generating coherent longer videos from multiple sequential text prompts using the MM-DiT architecture. It targets researchers and practitioners in AI video generation seeking to improve prompt adherence and temporal consistency without extensive retraining. The primary benefit is achieving smooth transitions and consistent object motion across complex, multi-part narratives.

How It Works

DiTCtrl leverages the attention mechanisms within the MM-DiT architecture, treating multi-prompt video generation as temporal video editing. By analyzing and reweighting specific attention maps between text and video tokens, it enables precise semantic control. This approach allows for smooth transitions and consistent object motion across sequential prompts without requiring additional training, effectively enabling prompt-to-prompt style editing capabilities within the diffusion model.

Quick Start & Requirements

  • Install: Requires Python 3.10, PyTorch 2.4.1 with CUDA 12.1 support, and xformers (0.0.28.post1).
  • Prerequisites: Download CogVideoX-2B model weights (VAE and Transformer components) and a T5 text encoder model.
  • Setup: Environment setup involves conda environment creation, package installation, and model weight arrangement.
  • Run: Execute generation via shell scripts like bash run_multi_prompt.sh.
  • Details: CogVideoX

Highlighted Details

  • First tuning-free approach for multi-prompt longer video generation based on MM-DiT.
  • Enables video editing tasks like word swapping and reweighting via attention map manipulation.
  • Introduces MPVBench, a new benchmark for evaluating multi-prompt video generation.
  • Code released for CVPR 2025 paper.

Maintenance & Community

The project is associated with researchers from The Chinese University of Hong Kong and Tencent ARC Lab. The codebase builds upon CogVideoX, MasaCtrl, MimicMotion, FreeNoise, and prompt-to-prompt.

Licensing & Compatibility

Released under a custom LICENSE file. Compatibility for commercial use or closed-source linking would require reviewing this specific license.

Limitations & Caveats

The setup requires specific versions of PyTorch and CUDA, and downloading large model weights. The project is based on the CogVideoX architecture, and a Diffuser version is listed as a future task.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.