HunyuanVideo-Foley by Tencent-Hunyuan

AI-powered Foley sound generation for video

Created 4 months ago

1,300 stars

Top 30.6% on SourcePulse

Project Summary

HunyuanVideo-Foley addresses the challenge of generating high-fidelity Foley audio synchronized with video content. It is designed for video content creators, film production, advertising, and game development, offering professional-grade AI sound effect generation that enhances realism and immersion.

How It Works

The model employs a hybrid architecture combining multimodal and unimodal transformer blocks. It processes visual-audio streams simultaneously using multimodal transformers, while unimodal transformers refine the audio stream. Visual features are extracted via a pre-trained encoder, and text is processed by a separate encoder. Audio is encoded into latent representations with Gaussian noise. Temporal alignment is achieved using a Synchformer-based approach with gated modulation, ensuring frame-level synchronization. This approach balances visual and textual information for comprehensive sound effect generation.

Quick Start & Requirements

Installation: Clone the repository, then install dependencies using pip install -r requirements.txt.
Prerequisites: CUDA 12.4 or 11.8 recommended, Python 3.8+, Linux OS. Requires approximately 20GB of VRAM (GPU >= 24GB recommended, e.g., RTX 3090/4090).
Pretrained Models: Download from Huggingface (git clone https://huggingface.co/tencent/HunyuanVideo-Foley).
Usage: Single video generation (python3 infer.py --single_video ...), batch processing (python3 infer.py --csv_path ...), or via an interactive Gradio web interface (python3 gradio_app.py).
Documentation: Usage examples and model details are provided in the README.

Highlighted Details

Achieves state-of-the-art performance across multiple evaluation benchmarks, including audio fidelity, visual-semantic alignment, and temporal alignment.
Supports multi-scenario audio-visual synchronization, handling complex video scenes.
Generates 48kHz Hi-Fi audio output using a self-developed audio VAE.
Intelligently balances visual and textual information for personalized dubbing requirements.

Maintenance & Community

The project is from Tencent Hunyuan, with contributions from Zhejiang University and Nanjing University of Aeronautics and Astronautics. Links to GitHub, Twitter, and the HunyuanAI website are provided for connection.

Licensing & Compatibility

Limitations & Caveats

The model primarily supports Linux and requires significant VRAM (20GB+), potentially limiting its use on consumer-grade hardware without high-end GPUs.

HunyuanVideo-Foley by Tencent-Hunyuan

Explore Similar Projects

AudioStory by TencentARC

soundstorm-pytorch by lucidrains

MiMo-Audio by XiaomiMiMo

tango by declare-lab

FunMusic by FunAudioLLM

AudioLDM2 by haoheliu

MMAudio by hkchengrex

AudioLDM by haoheliu

audiolm-pytorch by lucidrains

Kimi-Audio by MoonshotAI

hifi-gan by jik876

audiocraft by facebookresearch