Tora  by alibaba

Research paper for trajectory-oriented video generation using diffusion transformers

Created 11 months ago
1,199 stars

Top 32.6% on SourcePulse

GitHubView on GitHub
Project Summary

Tora is a framework for trajectory-oriented diffusion transformer-based video generation, enabling concurrent control over textual, visual, and motion conditions. It targets researchers and developers in AI video generation seeking precise control over video dynamics and physical movement simulation.

How It Works

Tora integrates a Trajectory Extractor (TE) and a Motion-guidance Fuser (MGF) with a Diffusion Transformer (DiT) architecture. The TE encodes arbitrary trajectories into hierarchical spacetime motion patches using a 3D video compression network. The MGF then fuses these motion patches into DiT blocks, facilitating the generation of videos that adhere to specified trajectories, offering control over duration, aspect ratio, and resolution.

Quick Start & Requirements

  • Installation: Clone the repository, set up a Python 3.10-3.12 environment with PyTorch 2.4.0 and CUDA 12.1, then install dependencies via pip install -e . within the modules/SwissArmyTransformer directory and pip install -r requirements.txt in the sat directory.
  • Prerequisites: Python 3.10-3.12, PyTorch 2.4.0, CUDA 12.1.
  • Model Weights: Download required weights (VAE, T5, Tora) and place them in Tora/sat/ckpts. Note that Tora weights require adherence to the CogVideoX License.
  • Resources: Inference requires ~30 GiB VRAM (A100), training requires ~60 GiB VRAM (A100).
  • Demos & Docs: ModelScope Demo, CogVideoX Documentation.

Highlighted Details

  • Supports Text-to-Video and Image-to-Video generation.
  • Achieves ~52% speedup per inference step with SageAttention2 and model compilation (tested on A10).
  • Reduced inference VRAM requirements to ~5 GiB in the diffusers version.
  • Offers training code for Text-to-Video.

Maintenance & Community

The project is actively updated, with recent releases including Image-to-Video functionality, diffusers integration, and training code. It acknowledges contributions from CogVideo, Open-Sora, and MotionCtrl.

Licensing & Compatibility

Model weights require adherence to the CogVideoX License. The project's licensing for code is not explicitly stated in the README, but its reliance on CogVideoX suggests potential commercial use restrictions.

Limitations & Caveats

The initial release (CogVideoX version) is for academic research purposes only, with a statement indicating commercial plans may restrict full open-sourcing. Text prompts are recommended to be enhanced by GPT-4 for optimal results.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

SkyReels-V2 by SkyworkAI

3.3%
4k
Film generation model for infinite-length videos using diffusion forcing
Created 5 months ago
Updated 1 month ago
Feedback? Help us improve.