BigVGAN  by NVIDIA

PyTorch for universal neural vocoding via large-scale training

created 3 years ago
1,078 stars

Top 35.8% on sourcepulse

GitHubView on GitHub
Project Summary

BigVGAN is an official PyTorch implementation of a universal neural vocoder designed for high-fidelity audio synthesis. It targets researchers and developers in speech synthesis (TTS) and audio generation, offering significant improvements in audio quality and inference speed over previous models.

How It Works

BigVGAN employs a large-scale training approach with a multi-scale sub-band CQT discriminator and a multi-scale mel spectrogram loss. A key innovation is a custom fused CUDA kernel for anti-aliased activation (upsampling + activation + downsampling), which accelerates inference by 1.5-3x on an A100 GPU compared to standard PyTorch operations.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt. A Conda environment setup is provided.
  • Prerequisites: Python 3.10, PyTorch 2.3.1 with CUDA 12.1 or 11.8.
  • Inference: Hugging Face Hub integration simplifies loading pretrained checkpoints and inference. A local Gradio demo is also available.
  • Links: Paper, Code, Showcase, Project Page, Weights, Demo

Highlighted Details

  • Offers pretrained checkpoints for various sampling rates (up to 44 kHz) and upsampling ratios (up to 512x).
  • Custom CUDA kernel provides significant speedups for inference.
  • Trained on diverse datasets including speech, environmental sounds, and instruments.
  • Achieves state-of-the-art objective metrics (PESQ, M-STFT, MCD) on audio quality benchmarks.

Maintenance & Community

  • Actively maintained with recent updates in July and September 2024.
  • Integrates with Hugging Face Hub for easy access and demos.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Training from scratch with small batch sizes may require adjusting clip_grad_norm to avoid early divergence.
  • CUDA kernel build requires compatible nvcc and PyTorch versions; failures indicate potential issues.
Health Check
Last commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
67 stars in the last 90 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
2 more.

tacotron2 by NVIDIA

0.0%
5k
PyTorch implementation for text-to-speech synthesis
created 7 years ago
updated 1 year ago
Feedback? Help us improve.