MOVA  by OpenMOSS

Scalable foundation model for synchronized video-audio generation

Created 4 weeks ago

New!

713 stars

Top 48.1% on SourcePulse

GitHubView on GitHub
Project Summary

MOVA is a foundation model addressing the challenge of synchronized video and audio generation, aiming to break the "silent era" of open-source video synthesis. It targets researchers and developers seeking high-fidelity, perfectly aligned video and audio outputs, offering a significant benefit over cascaded pipelines by generating both modalities simultaneously in a single inference pass.

How It Works

MOVA employs an Asymmetric Dual-Tower Architecture, leveraging pre-trained video and audio models. These towers are fused via a bidirectional cross-attention mechanism, enabling rich modality interaction. This native bimodal generation approach avoids error accumulation inherent in cascaded systems and achieves precise lip-sync and environment-aware sound effects.

Quick Start & Requirements

  • Installation: Set up a Python 3.13 environment (conda create -n mova python=3.13 -y, conda activate mova) and install the package (pip install -e .). Training requires pip install -e ".[train]".
  • Prerequisites: Python 3.13, Hugging Face Hub for model downloads. Inference and training require significant VRAM and Host RAM, detailed in performance tables below. Ascend NPU support is available.
  • Resource Footprint:
    • Inference (360p, 8s video): Component-wise offload requires ~48GB VRAM and ~67GB Host RAM (RTX 4090) or ~9GB/67GB (H100). Layerwise offload requires ~12GB VRAM and ~77GB Host RAM (RTX 4090/H100).
    • Training (360p, 8s video): Low-resource LoRA (single GPU) needs ≈18GB VRAM and ≈80GB Host RAM (RTX 4090). Accelerate + FSDP LoRA (8 GPUs) needs ≈50GB VRAM/GPU and ≥128GB Host RAM (H100).
  • Links: Demos are available via links in the README. Model weights are downloadable from Hugging Face (e.g., hf download OpenMOSS-Team/MOVA-360p).

Highlighted Details

  • Native bimodal generation of video and synchronized audio in a single pass.
  • State-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects, outperforming existing open-source models on Verse-Bench metrics.
  • Fully open-source release including model weights, inference code, training pipelines, and LoRA fine-tuning scripts.
  • Asymmetric Dual-Tower Architecture with bidirectional cross-attention for modality fusion.
  • Support for Ascend NPUs and integration with SGLang for inference.

Maintenance & Community

The project was released on January 29, 2026. Key features like checkpoints, multi-GPU inference, LoRA fine-tuning, NPU support, and SGLang integration are complete. Pending items include a Technical Report, Generation Workflow, and Diffusers Integration. Acknowledgements list contributions from several other open-source projects. No direct community links (Discord/Slack) or social handles are provided.

Licensing & Compatibility

The license type is not explicitly stated in the provided README. This lack of explicit licensing information poses a significant adoption blocker, particularly for commercial use or integration into closed-source projects.

Limitations & Caveats

Training 8-second, 360p videos on consumer hardware like an RTX 4090 is not recommended due to high resource requirements and slow training speeds; reducing resolution or frame count is suggested. A technical report and Diffusers integration are still pending. The absence of a clearly defined license is a critical caveat for adoption.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
19
Issues (30d)
20
Star History
732 stars in the last 28 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.1%
3k
Audio generation research paper using latent diffusion
Created 3 years ago
Updated 8 months ago
Feedback? Help us improve.