HunyuanVideo-Foley  by Tencent-Hunyuan

AI-powered Foley sound generation for video

Created 1 month ago
1,070 stars

Top 35.4% on SourcePulse

GitHubView on GitHub
Project Summary

HunyuanVideo-Foley addresses the challenge of generating high-fidelity Foley audio synchronized with video content. It is designed for video content creators, film production, advertising, and game development, offering professional-grade AI sound effect generation that enhances realism and immersion.

How It Works

The model employs a hybrid architecture combining multimodal and unimodal transformer blocks. It processes visual-audio streams simultaneously using multimodal transformers, while unimodal transformers refine the audio stream. Visual features are extracted via a pre-trained encoder, and text is processed by a separate encoder. Audio is encoded into latent representations with Gaussian noise. Temporal alignment is achieved using a Synchformer-based approach with gated modulation, ensuring frame-level synchronization. This approach balances visual and textual information for comprehensive sound effect generation.

Quick Start & Requirements

  • Installation: Clone the repository, then install dependencies using pip install -r requirements.txt.
  • Prerequisites: CUDA 12.4 or 11.8 recommended, Python 3.8+, Linux OS. Requires approximately 20GB of VRAM (GPU >= 24GB recommended, e.g., RTX 3090/4090).
  • Pretrained Models: Download from Huggingface (git clone https://huggingface.co/tencent/HunyuanVideo-Foley).
  • Usage: Single video generation (python3 infer.py --single_video ...), batch processing (python3 infer.py --csv_path ...), or via an interactive Gradio web interface (python3 gradio_app.py).
  • Documentation: Usage examples and model details are provided in the README.

Highlighted Details

  • Achieves state-of-the-art performance across multiple evaluation benchmarks, including audio fidelity, visual-semantic alignment, and temporal alignment.
  • Supports multi-scenario audio-visual synchronization, handling complex video scenes.
  • Generates 48kHz Hi-Fi audio output using a self-developed audio VAE.
  • Intelligently balances visual and textual information for personalized dubbing requirements.

Maintenance & Community

The project is from Tencent Hunyuan, with contributions from Zhejiang University and Nanjing University of Aeronautics and Astronautics. Links to GitHub, Twitter, and the HunyuanAI website are provided for connection.

Licensing & Compatibility

The repository is © 2025 Tencent Hunyuan. All rights reserved. Specific licensing details for commercial use or closed-source linking are not explicitly detailed in the provided README snippet.

Limitations & Caveats

The model primarily supports Linux and requires significant VRAM (20GB+), potentially limiting its use on consumer-grade hardware without high-end GPUs.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
24
Star History
1,084 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.1%
3k
Audio generation research paper using latent diffusion
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.