LongVie  by Vchitect

Multimodal world model for ultra-long video generation

Created 5 months ago
297 stars

Top 89.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

LongVie 2 addresses the challenge of generating ultra-long, controllable videos by introducing a multimodal world model. It targets researchers and developers in AI video generation, offering precise control over video output through depth maps and trajectory signals, enabling more complex and coherent long-form content creation.

How It Works

LongVie 2 presents a multimodal controllable world model engineered for the synthesis of ultra-long video sequences. Its core innovation lies in its ability to integrate and respond to explicit control signals, specifically depth maps and pointmaps (representing trajectories), during the generation process. This approach allows for unprecedented fine-grained manipulation and coherence over extended video durations, moving beyond standard text-conditional generation to a more structured, controllable paradigm for complex visual narratives.

Quick Start & Requirements

  • Installation: Requires Python 3.10, PyTorch 2.5.1 (with CUDA 12.1), psutil, ninja, and flash-attention v2.7.2.post1. Installation involves creating a Conda environment, activating it, installing dependencies, and then installing the package in editable mode (pip install -e .).
  • Prerequisites: CUDA 12.1 is specified for PyTorch.
  • Resource Footprint: Inference for a 5-second video clip takes approximately 8-9 minutes on a single A100 GPU.
  • Links:

Highlighted Details

  • Specializes in generating "ultra-long" video sequences, addressing a key challenge in current AI video models.
  • Offers robust controllability via depth maps and pointmap (trajectory) signals, enabling precise scene and motion guidance.
  • Includes dedicated utilities for extracting and processing these control signals from existing media or generated content.
  • The model is based on the Wan2.1-I2V-14B-480P base model, suggesting a large-scale foundation.

Maintenance & Community

No specific details on contributors, sponsorships, community channels (Discord/Slack), or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state the project's license or provide compatibility notes for commercial use.

Limitations & Caveats

Inference is computationally intensive, with a 5-second clip requiring significant GPU time (8-9 minutes on an A100). The project's status (e.g., alpha, beta) is not specified.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
10
Star History
213 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.