MOVA by OpenMOSS

Scalable foundation model for synchronized video-audio generation

Created 4 weeks ago

New!

713 stars

Top 48.1% on SourcePulse

Project Summary

MOVA is a foundation model addressing the challenge of synchronized video and audio generation, aiming to break the "silent era" of open-source video synthesis. It targets researchers and developers seeking high-fidelity, perfectly aligned video and audio outputs, offering a significant benefit over cascaded pipelines by generating both modalities simultaneously in a single inference pass.

How It Works

MOVA employs an Asymmetric Dual-Tower Architecture, leveraging pre-trained video and audio models. These towers are fused via a bidirectional cross-attention mechanism, enabling rich modality interaction. This native bimodal generation approach avoids error accumulation inherent in cascaded systems and achieves precise lip-sync and environment-aware sound effects.

Quick Start & Requirements

Installation: Set up a Python 3.13 environment (conda create -n mova python=3.13 -y, conda activate mova) and install the package (pip install -e .). Training requires pip install -e ".[train]".
Prerequisites: Python 3.13, Hugging Face Hub for model downloads. Inference and training require significant VRAM and Host RAM, detailed in performance tables below. Ascend NPU support is available.
Resource Footprint:
- Inference (360p, 8s video): Component-wise offload requires ~48GB VRAM and ~67GB Host RAM (RTX 4090) or ~9GB/67GB (H100). Layerwise offload requires ~12GB VRAM and ~77GB Host RAM (RTX 4090/H100).
- Training (360p, 8s video): Low-resource LoRA (single GPU) needs ≈18GB VRAM and ≈80GB Host RAM (RTX 4090). Accelerate + FSDP LoRA (8 GPUs) needs ≈50GB VRAM/GPU and ≥128GB Host RAM (H100).
Links: Demos are available via links in the README. Model weights are downloadable from Hugging Face (e.g., hf download OpenMOSS-Team/MOVA-360p).

Highlighted Details

Native bimodal generation of video and synchronized audio in a single pass.
State-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects, outperforming existing open-source models on Verse-Bench metrics.
Fully open-source release including model weights, inference code, training pipelines, and LoRA fine-tuning scripts.
Asymmetric Dual-Tower Architecture with bidirectional cross-attention for modality fusion.
Support for Ascend NPUs and integration with SGLang for inference.

Maintenance & Community

The project was released on January 29, 2026. Key features like checkpoints, multi-GPU inference, LoRA fine-tuning, NPU support, and SGLang integration are complete. Pending items include a Technical Report, Generation Workflow, and Diffusers Integration. Acknowledgements list contributions from several other open-source projects. No direct community links (Discord/Slack) or social handles are provided.

Licensing & Compatibility

The license type is not explicitly stated in the provided README. This lack of explicit licensing information poses a significant adoption blocker, particularly for commercial use or integration into closed-source projects.

Limitations & Caveats

Training 8-second, 360p videos on consumer hardware like an RTX 4090 is not recommended due to high resource requirements and slow training speeds; reducing resolution or frame count is suggested. A technical report and Diffusers integration are still pending. The absence of a clearly defined license is a critical caveat for adoption.

Health Check

Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

732 stars in the last 28 days