SpecVQGAN  by v-iashin

Research code for visually guided sound generation via codebook sampling

created 3 years ago
363 stars

Top 78.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the implementation for "Taming Visually Guided Sound Generation," a method for generating relevant, high-fidelity audio from visual cues. It's targeted at researchers and developers interested in cross-modal generation, particularly audio synthesis guided by video. The core benefit is enabling controllable and high-quality sound generation conditioned on visual input.

How It Works

The approach uses a Spectrogram VQGAN to learn a codebook of representative spectrogram vectors. A transformer model (GPT-2 variant) is then trained to autoregressively sample these codebook entries as tokens, conditioned on extracted visual features from video. This two-stage process allows for generating long, coherent audio sequences that align with visual content.

Quick Start & Requirements

  • Install: Clone the repository and create a conda environment using conda env create -f conda_env.yml.
  • Prerequisites: PyTorch 1.8, CUDA 11, Linux. Datasets VAS and VGGSound are required, with feature extraction scripts provided.
  • Resources: Feature extraction can be resource-intensive (e.g., 6 days on three 2080Ti for VGGSound). Pre-trained models are available.
  • Links: Project Page, ArXiv, YouTube Presentation

Highlighted Details

  • Offers pre-trained codebooks and transformers for various configurations (e.g., VGGSound, VAS datasets, different feature extractors like BN-Inception and ResNet50).
  • Includes evaluation metrics (FID, Avg. MKL) and sampling tools for assessing generated audio quality.
  • Provides scripts for training both the spectrogram codebook and the transformer model.
  • Features a Streamlit-based sampling tool for interactive generation and a demo for its use as a neural audio codec.

Maintenance & Community

The project is associated with the BMVC 2021 oral presentation. The codebase is built upon the "taming-transformers" repository. No explicit community channels (Discord/Slack) are mentioned.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions a potential issue with perceptual loss silently failing during training, suggesting disabling it for faster iterations. The dataset sizes (VAS ~24GB, VGGSound ~420GB) and feature extraction times indicate significant resource requirements for training from scratch.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.