MMAudio by hkchengrex

Synthesize high-quality audio from video and text

Created 1 year ago

2,052 stars

Top 21.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jiaming Song

Chief Scientist at Luma AI

Project Summary

MMAudio addresses the challenge of generating high-quality, synchronized audio for video content using multimodal joint training. It is designed for researchers and developers working with audio-visual synthesis, offering a novel approach to combine diverse audio-visual and audio-text datasets for improved model performance. The primary benefit is the ability to synthesize synchronized audio that accurately matches video inputs, enhancing the realism and impact of multimedia content.

How It Works

MMAudio employs a multimodal joint training strategy, enabling it to learn from a wide array of audio-visual and audio-text datasets. A key component is its synchronization module, which specifically aligns the generated audio with the corresponding video frames. This approach allows for a more robust and accurate synthesis of audio that is temporally coherent with the visual information, a significant advancement over methods trained on single modalities.

Quick Start & Requirements

Installation: Clone the repository and install via pip install -e . after setting up PyTorch with CUDA support.
Prerequisites: Python 3.9+, PyTorch 2.5.1+ with matching torchvision/torchaudio (CUDA 11.8 or compatible recommended). Tested on Ubuntu.
Resources: Inference requires approximately 6GB of GPU memory (16-bit mode).
Demos: Huggingface Demo, Colab Demo, Replicate Demo are available.

Highlighted Details

Supports both video-to-audio and text-to-audio synthesis.
Experimental image-to-audio synthesis is also available.
CLIP encoder resizes frames to 384x384; Synchformer uses a 224x224 center crop.
CLIP operates at 8 FPS, Synchformer at 25 FPS, with on-the-fly frame rate conversion.

Maintenance & Community

The project is associated with CVPR 2025. Further community interaction details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The project is trained on datasets (AudioSet, Freesound, VGGSound, AudioCaps, WavCaps) with their own licenses. The README states, "We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk."

Limitations & Caveats

The model may generate unintelligible speech-like sounds or background music without explicit training, and it struggles with unfamiliar concepts (e.g., specific weapon sounds). Performance can vary across different hardware and software environments. The README notes that higher-resolution videos do not improve audio quality and can increase processing time due to encoding/decoding.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

49 stars in the last 30 days