Research code for visually guided sound generation via codebook sampling
Top 78.5% on sourcepulse
This repository provides the implementation for "Taming Visually Guided Sound Generation," a method for generating relevant, high-fidelity audio from visual cues. It's targeted at researchers and developers interested in cross-modal generation, particularly audio synthesis guided by video. The core benefit is enabling controllable and high-quality sound generation conditioned on visual input.
How It Works
The approach uses a Spectrogram VQGAN to learn a codebook of representative spectrogram vectors. A transformer model (GPT-2 variant) is then trained to autoregressively sample these codebook entries as tokens, conditioned on extracted visual features from video. This two-stage process allows for generating long, coherent audio sequences that align with visual content.
Quick Start & Requirements
conda env create -f conda_env.yml
.Highlighted Details
Maintenance & Community
The project is associated with the BMVC 2021 oral presentation. The codebase is built upon the "taming-transformers" repository. No explicit community channels (Discord/Slack) are mentioned.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README mentions a potential issue with perceptual loss silently failing during training, suggesting disabling it for faster iterations. The dataset sizes (VAS ~24GB, VGGSound ~420GB) and feature extraction times indicate significant resource requirements for training from scratch.
1 year ago
Inactive