MOVA  by OpenMOSS

Scalable foundation model for synchronized video-audio generation

Created 2 months ago
893 stars

Top 40.4% on SourcePulse

GitHubView on GitHub
Project Summary

MOVA is a foundation model addressing the challenge of synchronized video and audio generation, aiming to break the "silent era" of open-source video synthesis. It targets researchers and developers seeking high-fidelity, perfectly aligned video and audio outputs, offering a significant benefit over cascaded pipelines by generating both modalities simultaneously in a single inference pass.

How It Works

MOVA employs an Asymmetric Dual-Tower Architecture, leveraging pre-trained video and audio models. These towers are fused via a bidirectional cross-attention mechanism, enabling rich modality interaction. This native bimodal generation approach avoids error accumulation inherent in cascaded systems and achieves precise lip-sync and environment-aware sound effects.

Quick Start & Requirements

  • Installation: Set up a Python 3.13 environment (conda create -n mova python=3.13 -y, conda activate mova) and install the package (pip install -e .). Training requires pip install -e ".[train]".
  • Prerequisites: Python 3.13, Hugging Face Hub for model downloads. Inference and training require significant VRAM and Host RAM, detailed in performance tables below. Ascend NPU support is available.
  • Resource Footprint:
    • Inference (360p, 8s video): Component-wise offload requires ~48GB VRAM and ~67GB Host RAM (RTX 4090) or ~9GB/67GB (H100). Layerwise offload requires ~12GB VRAM and ~77GB Host RAM (RTX 4090/H100).
    • Training (360p, 8s video): Low-resource LoRA (single GPU) needs ≈18GB VRAM and ≈80GB Host RAM (RTX 4090). Accelerate + FSDP LoRA (8 GPUs) needs ≈50GB VRAM/GPU and ≥128GB Host RAM (H100).
  • Links: Demos are available via links in the README. Model weights are downloadable from Hugging Face (e.g., hf download OpenMOSS-Team/MOVA-360p).

Highlighted Details

  • Native bimodal generation of video and synchronized audio in a single pass.
  • State-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects, outperforming existing open-source models on Verse-Bench metrics.
  • Fully open-source release including model weights, inference code, training pipelines, and LoRA fine-tuning scripts.
  • Asymmetric Dual-Tower Architecture with bidirectional cross-attention for modality fusion.
  • Support for Ascend NPUs and integration with SGLang for inference.

Maintenance & Community

The project was released on January 29, 2026. Key features like checkpoints, multi-GPU inference, LoRA fine-tuning, NPU support, and SGLang integration are complete. Pending items include a Technical Report, Generation Workflow, and Diffusers Integration. Acknowledgements list contributions from several other open-source projects. No direct community links (Discord/Slack) or social handles are provided.

Licensing & Compatibility

The license type is not explicitly stated in the provided README. This lack of explicit licensing information poses a significant adoption blocker, particularly for commercial use or integration into closed-source projects.

Limitations & Caveats

Training 8-second, 360p videos on consumer hardware like an RTX 4090 is not recommended due to high resource requirements and slow training speeds; reducing resolution or frame count is suggested. A technical report and Diffusers integration are still pending. The absence of a clearly defined license is a critical caveat for adoption.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
4
Star History
78 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.2%
3k
Audio generation research paper using latent diffusion
Created 3 years ago
Updated 9 months ago
Feedback? Help us improve.