SpecVQGAN by v-iashin

Research code for visually guided sound generation via codebook sampling

Created 4 years ago

371 stars

Top 76.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Alexander Borzunov

Research Scientist at OpenAI

Project Summary

This repository provides the implementation for "Taming Visually Guided Sound Generation," a method for generating relevant, high-fidelity audio from visual cues. It's targeted at researchers and developers interested in cross-modal generation, particularly audio synthesis guided by video. The core benefit is enabling controllable and high-quality sound generation conditioned on visual input.

How It Works

The approach uses a Spectrogram VQGAN to learn a codebook of representative spectrogram vectors. A transformer model (GPT-2 variant) is then trained to autoregressively sample these codebook entries as tokens, conditioned on extracted visual features from video. This two-stage process allows for generating long, coherent audio sequences that align with visual content.

Quick Start & Requirements

Install: Clone the repository and create a conda environment using conda env create -f conda_env.yml.
Prerequisites: PyTorch 1.8, CUDA 11, Linux. Datasets VAS and VGGSound are required, with feature extraction scripts provided.
Resources: Feature extraction can be resource-intensive (e.g., 6 days on three 2080Ti for VGGSound). Pre-trained models are available.
Links: Project Page, ArXiv, YouTube Presentation

Highlighted Details

Offers pre-trained codebooks and transformers for various configurations (e.g., VGGSound, VAS datasets, different feature extractors like BN-Inception and ResNet50).
Includes evaluation metrics (FID, Avg. MKL) and sampling tools for assessing generated audio quality.
Provides scripts for training both the spectrogram codebook and the transformer model.
Features a Streamlit-based sampling tool for interactive generation and a demo for its use as a neural audio codec.

Maintenance & Community

The project is associated with the BMVC 2021 oral presentation. The codebase is built upon the "taming-transformers" repository. No explicit community channels (Discord/Slack) are mentioned.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions a potential issue with perceptual loss silently failing during training, suggesting disabling it for faster iterations. The dataset sizes (VAS ~24GB, VGGSound ~420GB) and feature extraction times indicate significant resource requirements for training from scratch.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days