OmniWeaving by Tencent-Hunyuan

Unified video generation with free-form composition and reasoning

Created 3 months ago

887 stars

Top 39.9% on SourcePulse

Project Summary

Summary

OmniWeaving addresses unified video generation by integrating multimodal composition and reasoning capabilities. It targets researchers and practitioners seeking advanced video creation, enabling sophisticated outputs from complex, interleaved user inputs by understanding nuanced intentions.

How It Works

The architecture combines a Multimodal Large Language Model (MLLM) for semantic parsing, a Variational Autoencoder (VAE) for visual tokenization, and a Multimodal Diffusion Transformer (MMDiT) for generation. Novelties include an "Activating Thinking Mode" where the MLLM actively reasons to refine prompts, and "Hidden States DeepStacking" which injects multi-granular semantic guidance from various MLLM layers into the MMDiT. This approach yields state-of-the-art performance among open-source unified models.

Quick Start & Requirements

Installation involves cloning the repository and installing dependencies via pip install -r requirements.txt. Optional acceleration libraries like Flash Attention, Flex-Block-Attention, or SageAttention can be installed for performance gains. Model weights are available on HuggingFace. Training data construction requires a VLM server (e.g., Qwen3-VL-235B). Inference examples suggest multi-GPU setups are beneficial.

Highlighted Details

Built on HunyuanVideo-1.5, OmniWeaving introduces IntelligentVBench for evaluating unified video generation. It supports diverse tasks including Text-to-Video (T2V), Image-to-Video (I2V), video editing, and compositional generation with multiple subjects and modalities.

Maintenance & Community

Developed by Tencent's HunyuanVideo team. Acknowledges contributions from key open-source projects like Transformers and Diffusers. No direct community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README, posing a significant adoption barrier. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

Training data construction pipelines are provided for representative tasks, but some may be simplified or omit components. The requirement for a VLM server for data preparation is a notable setup hurdle. The absence of a clear license is a critical limitation for deployment.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days