univnet by maum-ai

PyTorch implementation of UnivNet vocoder for high-fidelity waveform generation

Created 4 years ago

279 stars

Top 93.2% on SourcePulse

2 Experts Love This Project

andreasjansson

Andreas Jansson

Cofounder of Replicate

PiotrDabkowski

Piotr Dąbkowski

Cofounder of ElevenLabs

Project Summary

UnivNet provides an unofficial PyTorch implementation of the UnivNet neural vocoder, designed for high-fidelity audio waveform generation. It targets researchers and developers working on text-to-speech (TTS) systems who require a fast and accurate vocoder, claiming superior objective and subjective performance over HiFi-GAN.

How It Works

UnivNet employs a multi-resolution spectrogram discriminator architecture, a key innovation that allows it to capture audio details across different frequency scales. This approach, combined with a GAN framework, enables high-fidelity waveform synthesis. The implementation leverages the same mel-spectrogram calculation as HiFi-GAN for compatibility with popular TTS models like Tacotron2.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.6, PyTorch 1.6.0, NumPy 1.17.4, SciPy 1.5.4. Requires audio data sampled at 24,000Hz.
Data: LibriTTS dataset is recommended. Metadata format: path_to_wav|transcript|speaker_id.
Configuration: Copy config/default_c32.yaml to config/config.yaml and update data paths.
Training: python trainer.py -c CONFIG_YAML_FILE -n NAME_OF_THE_RUN
Inference: python inference.py -p CHECKPOINT_PATH -i INPUT_MEL_PATH -o OUTPUT_WAV_PATH
Pre-trained Models: Available via Google Drive links in the README.

Highlighted Details

Claims 1.5x faster inference speed than HiFi-GAN.
Achieves superior objective scores (PESQ, RMSE) compared to official UnivNet and HiFi-GAN.
Supports both UnivNet-c16 and UnivNet-c32 configurations.
Compatible with NVIDIA/tacotron2 mel-spectrogram calculations.

Maintenance & Community

Developed by MINDsLab Inc. with contributors listed.
No explicit links to community channels (Discord/Slack) or roadmaps provided in the README.

Licensing & Compatibility

Licensed under BSD 3-Clause License.
Compatible with commercial use and closed-source linking due to the permissive BSD license.

Limitations & Caveats

The hop_length parameter for mel-spectrogram calculation is fixed at 256 and cannot be changed. The implementation is noted as unofficial.

Health Check

Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

Lyra by JIA-Lab-research

Omni-cognition framework for speech, image, and video understanding/generation

Created 1 year ago

Updated 1 year ago

radtts by NVIDIA

Flow-based TTS recipes for training, inference, and voice conversion

Created 3 years ago

Updated 2 years ago

VocGAN by rishikksh20

Real-time vocoder using a hierarchically-nested adversarial network

Created 5 years ago

Updated 1 year ago

vits2_pytorch by p0p4k

PyTorch implementation of the VITS2 text-to-speech model

Created 2 years ago

Updated 1 year ago

Starred by

Casper Hansen

Casper Hansen(Author of AutoAWQ).

melgan by seungwonpark

PyTorch implementation of MelGAN vocoder

Created 6 years ago

Updated 5 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind) and

Chenlin Meng

Chenlin Meng(Cofounder of Pika).

diffwave by lmnt-com

Neural vocoder and waveform synthesizer

Created 5 years ago

Updated 1 year ago

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs) and

Jong Wook Kim

Jong Wook Kim(Research Scientist at OpenAI).

BigVGAN by NVIDIA

PyTorch for universal neural vocoding via large-scale training

Created 3 years ago

Updated 1 year ago

TransformerTTS by spring-media

TensorFlow 2 implementation for non-autoregressive text-to-speech

Created 5 years ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Casper Hansen

Casper Hansen(Author of AutoAWQ), and

1 more.

ParallelWaveGAN by kan-bayashi

Pytorch vocoder for real-time speech synthesis, based on Parallel WaveGAN

Created 6 years ago

Updated 1 year ago

Starred by

Chenlin Meng

Chenlin Meng(Cofounder of Pika) and

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

hifi-gan by jik876

GAN for high-fidelity speech synthesis

Created 5 years ago

Updated 1 year ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic),

Andrew Kane

Andrew Kane(Author of pgvector), and

1 more.

TTS by mozilla

Deep learning library for text-to-speech generation

Created 8 years ago

Updated 2 years ago

Starred by

Jason Huggins

Jason Huggins(Creator of Selenium),

Michael Han

Michael Han(Cofounder of Unsloth), and

11 more.

TTS by coqui-ai

Deep learning toolkit for Text-to-Speech, research-tested

Created 5 years ago

Updated 1 year ago

Feedback? Help us improve.