BigVGAN by NVIDIA

PyTorch for universal neural vocoding via large-scale training

Created 3 years ago

1,168 stars

Top 33.1% on SourcePulse

2 Experts Love This Project

codekansas

Cofounder of K-Scale Labs

jongwook

Research Scientist at OpenAI

Project Summary

BigVGAN is an official PyTorch implementation of a universal neural vocoder designed for high-fidelity audio synthesis. It targets researchers and developers in speech synthesis (TTS) and audio generation, offering significant improvements in audio quality and inference speed over previous models.

How It Works

BigVGAN employs a large-scale training approach with a multi-scale sub-band CQT discriminator and a multi-scale mel spectrogram loss. A key innovation is a custom fused CUDA kernel for anti-aliased activation (upsampling + activation + downsampling), which accelerates inference by 1.5-3x on an A100 GPU compared to standard PyTorch operations.

Quick Start & Requirements

Install: Clone the repository and install dependencies via pip install -r requirements.txt. A Conda environment setup is provided.
Prerequisites: Python 3.10, PyTorch 2.3.1 with CUDA 12.1 or 11.8.
Inference: Hugging Face Hub integration simplifies loading pretrained checkpoints and inference. A local Gradio demo is also available.
Links: Paper, Code, Showcase, Project Page, Weights, Demo

Highlighted Details

Offers pretrained checkpoints for various sampling rates (up to 44 kHz) and upsampling ratios (up to 512x).
Custom CUDA kernel provides significant speedups for inference.
Trained on diverse datasets including speech, environmental sounds, and instruments.
Achieves state-of-the-art objective metrics (PESQ, M-STFT, MCD) on audio quality benchmarks.

Maintenance & Community

Actively maintained with recent updates in July and September 2024.
Integrates with Hugging Face Hub for easy access and demos.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training from scratch with small batch sizes may require adjusting clip_grad_norm to avoid early divergence.
CUDA kernel build requires compatible nvcc and PyTorch versions; failures indicate potential issues.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

13 stars in the last 30 days

Explore Similar Projects

radtts by NVIDIA

Flow-based TTS recipes for training, inference, and voice conversion

Created 3 years ago

Updated 2 years ago

wavegrad by lmnt-com

Neural vocoder for high-quality waveform generation from spectrograms

Created 5 years ago

Updated 2 years ago

kani-tts by nineninesix-ai

Fast, high-quality text-to-speech generation

Created 3 months ago

Updated 2 months ago

VITA-Audio by VITA-MLLM

Speech model for fast audio-text token generation

Created 8 months ago

Updated 7 months ago

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

vits2 by daniilrobnikov

Unofficial VITS2 implementation for single-stage text-to-speech research

Created 2 years ago

Updated 2 years ago

Starred by

Casper Hansen

Casper Hansen(Author of AutoAWQ).

melgan by seungwonpark

PyTorch implementation of MelGAN vocoder

Created 6 years ago

Updated 5 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind) and

Chenlin Meng

Chenlin Meng(Cofounder of Pika).

diffwave by lmnt-com

Neural vocoder and waveform synthesizer

Created 5 years ago

Updated 1 year ago

tacotronv2_wavernn_chinese by lturing

TTS pipeline for Chinese speech synthesis

Created 5 years ago

Updated 2 years ago

Starred by

Chenlin Meng

Chenlin Meng(Cofounder of Pika) and

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

hifi-gan by jik876

GAN for high-fidelity speech synthesis

Created 5 years ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Pietro Schirano

Pietro Schirano(Founder of MagicPath), and

2 more.

metavoice-src by metavoiceio

TTS model for human-like, expressive speech

Created 1 year ago

Updated 1 year ago

Spark-TTS by SparkAudio

PyTorch code for efficient LLM-based text-to-speech inference

Created 10 months ago

Updated 9 months ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and

3 more.

tacotron2 by NVIDIA

PyTorch implementation for text-to-speech synthesis

Created 7 years ago

Updated 1 year ago

Feedback? Help us improve.