LongVie by Vchitect

Multimodal world model for ultra-long video generation

Created 6 months ago

317 stars

Top 85.6% on SourcePulse

Project Summary

Summary

LongVie 2 addresses the challenge of generating ultra-long, controllable videos by introducing a multimodal world model. It targets researchers and developers in AI video generation, offering precise control over video output through depth maps and trajectory signals, enabling more complex and coherent long-form content creation.

How It Works

LongVie 2 presents a multimodal controllable world model engineered for the synthesis of ultra-long video sequences. Its core innovation lies in its ability to integrate and respond to explicit control signals, specifically depth maps and pointmaps (representing trajectories), during the generation process. This approach allows for unprecedented fine-grained manipulation and coherence over extended video durations, moving beyond standard text-conditional generation to a more structured, controllable paradigm for complex visual narratives.

Quick Start & Requirements

Installation: Requires Python 3.10, PyTorch 2.5.1 (with CUDA 12.1), psutil, ninja, and flash-attention v2.7.2.post1. Installation involves creating a Conda environment, activating it, installing dependencies, and then installing the package in editable mode (pip install -e .).
Prerequisites: CUDA 12.1 is specified for PyTorch.
Resource Footprint: Inference for a 5-second video clip takes approximately 8-9 minutes on a single A100 GPU.
Links:
- arXiv (LongVie): https://arxiv.org/abs/2508.03694
- arXiv (LongVie 2): https://arxiv.org/abs/2512.13604

Highlighted Details

Specializes in generating "ultra-long" video sequences, addressing a key challenge in current AI video models.
Offers robust controllability via depth maps and pointmap (trajectory) signals, enabling precise scene and motion guidance.
Includes dedicated utilities for extracting and processing these control signals from existing media or generated content.
The model is based on the Wan2.1-I2V-14B-480P base model, suggesting a large-scale foundation.

Maintenance & Community

No specific details on contributors, sponsorships, community channels (Discord/Slack), or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state the project's license or provide compatibility notes for commercial use.

Limitations & Caveats

Inference is computationally intensive, with a 5-second clip requiring significant GPU time (8-9 minutes on an A100). The project's status (e.g., alpha, beta) is not specified.

LongVie by Vchitect

Explore Similar Projects

dolphin by kaleido-lab

TATS by songweige

FreeNoise by AILab-CVC

MotionClone by LPengYang

WonderJourney by KovenYu

jianying-editor-skill by luoluoluo22

Allegro by rhymes-ai

Awesome-Video-Diffusion-Models by ChenHsing

Step-Video-T2V by stepfun-ai

Text2Video-Zero by Picsart-AI-Research

SkyReels-V2 by SkyworkAI

Wan2.2 by Wan-Video