Tora by alibaba

Research paper for trajectory-oriented video generation using diffusion transformers

Created 1 year ago

1,225 stars

Top 32.0% on SourcePulse

Project Summary

Tora is a framework for trajectory-oriented diffusion transformer-based video generation, enabling concurrent control over textual, visual, and motion conditions. It targets researchers and developers in AI video generation seeking precise control over video dynamics and physical movement simulation.

How It Works

Tora integrates a Trajectory Extractor (TE) and a Motion-guidance Fuser (MGF) with a Diffusion Transformer (DiT) architecture. The TE encodes arbitrary trajectories into hierarchical spacetime motion patches using a 3D video compression network. The MGF then fuses these motion patches into DiT blocks, facilitating the generation of videos that adhere to specified trajectories, offering control over duration, aspect ratio, and resolution.

Quick Start & Requirements

Installation: Clone the repository, set up a Python 3.10-3.12 environment with PyTorch 2.4.0 and CUDA 12.1, then install dependencies via pip install -e . within the modules/SwissArmyTransformer directory and pip install -r requirements.txt in the sat directory.
Prerequisites: Python 3.10-3.12, PyTorch 2.4.0, CUDA 12.1.
Model Weights: Download required weights (VAE, T5, Tora) and place them in Tora/sat/ckpts. Note that Tora weights require adherence to the CogVideoX License.
Resources: Inference requires ~30 GiB VRAM (A100), training requires ~60 GiB VRAM (A100).
Demos & Docs: ModelScope Demo, CogVideoX Documentation.

Highlighted Details

Supports Text-to-Video and Image-to-Video generation.
Achieves ~52% speedup per inference step with SageAttention2 and model compilation (tested on A10).
Reduced inference VRAM requirements to ~5 GiB in the diffusers version.
Offers training code for Text-to-Video.

Maintenance & Community

The project is actively updated, with recent releases including Image-to-Video functionality, diffusers integration, and training code. It acknowledges contributions from CogVideo, Open-Sora, and MotionCtrl.

Licensing & Compatibility

Model weights require adherence to the CogVideoX License. The project's licensing for code is not explicitly stated in the README, but its reliance on CogVideoX suggests potential commercial use restrictions.

Limitations & Caveats

The initial release (CogVideoX version) is for academic research purposes only, with a statement indicating commercial plans may restrict full open-sourcing. Text prompts are recommended to be enhanced by GPT-4 for optimal results.

Tora by alibaba

Explore Similar Projects

TATS by songweige

Gen-L-Video by G-U-N

WonderJourney by KovenYu

Allegro by rhymes-ai

DiT-Extrapolation by thu-ml

WarpFusion by Sxela

Rerender_A_Video by williamyang1991

VGen by ali-vilab

Pyramid-Flow by jy0205

Step-Video-T2V by stepfun-ai

LTX-Video by Lightricks

SkyReels-V2 by SkyworkAI