Tora  by alibaba

Research paper for trajectory-oriented video generation using diffusion transformers

created 9 months ago
1,193 stars

Top 33.5% on sourcepulse

GitHubView on GitHub
Project Summary

Tora is a framework for trajectory-oriented diffusion transformer-based video generation, enabling concurrent control over textual, visual, and motion conditions. It targets researchers and developers in AI video generation seeking precise control over video dynamics and physical movement simulation.

How It Works

Tora integrates a Trajectory Extractor (TE) and a Motion-guidance Fuser (MGF) with a Diffusion Transformer (DiT) architecture. The TE encodes arbitrary trajectories into hierarchical spacetime motion patches using a 3D video compression network. The MGF then fuses these motion patches into DiT blocks, facilitating the generation of videos that adhere to specified trajectories, offering control over duration, aspect ratio, and resolution.

Quick Start & Requirements

  • Installation: Clone the repository, set up a Python 3.10-3.12 environment with PyTorch 2.4.0 and CUDA 12.1, then install dependencies via pip install -e . within the modules/SwissArmyTransformer directory and pip install -r requirements.txt in the sat directory.
  • Prerequisites: Python 3.10-3.12, PyTorch 2.4.0, CUDA 12.1.
  • Model Weights: Download required weights (VAE, T5, Tora) and place them in Tora/sat/ckpts. Note that Tora weights require adherence to the CogVideoX License.
  • Resources: Inference requires ~30 GiB VRAM (A100), training requires ~60 GiB VRAM (A100).
  • Demos & Docs: ModelScope Demo, CogVideoX Documentation.

Highlighted Details

  • Supports Text-to-Video and Image-to-Video generation.
  • Achieves ~52% speedup per inference step with SageAttention2 and model compilation (tested on A10).
  • Reduced inference VRAM requirements to ~5 GiB in the diffusers version.
  • Offers training code for Text-to-Video.

Maintenance & Community

The project is actively updated, with recent releases including Image-to-Video functionality, diffusers integration, and training code. It acknowledges contributions from CogVideo, Open-Sora, and MotionCtrl.

Licensing & Compatibility

Model weights require adherence to the CogVideoX License. The project's licensing for code is not explicitly stated in the README, but its reliance on CogVideoX suggests potential commercial use restrictions.

Limitations & Caveats

The initial release (CogVideoX version) is for academic research purposes only, with a statement indicating commercial plans may restrict full open-sourcing. Text prompts are recommended to be enhanced by GPT-4 for optimal results.

Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
9
Star History
63 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.