vits2_pytorch by p0p4k

PyTorch implementation of the VITS2 text-to-speech model

Created 2 years ago

545 stars

Top 58.5% on SourcePulse

Project Summary

This repository provides an unofficial PyTorch implementation of VITS2, a single-stage text-to-speech (TTS) model designed for improved naturalness, speech characteristic similarity, and computational efficiency. It targets researchers and developers seeking to build or fine-tune advanced TTS systems, offering a fully end-to-end approach that reduces reliance on external phoneme conversion.

How It Works

VITS2 enhances its predecessor by incorporating several architectural improvements. It features a transformer block within the normalizing flow for better sequence modeling, a speaker-conditioned text encoder for multi-speaker synthesis, and a duration predictor with adversarial loss and noise-scaled monotonic alignment search for more robust duration modeling. These components contribute to a more natural and efficient speech synthesis process.

Quick Start & Requirements

Install: Clone the repository and install requirements from requirements.txt.
Prerequisites: Python >= 3.10, PyTorch 1.13.1+, espeak.
Data: Download LJSpeech or VCTK datasets and create symbolic links. Preprocessing scripts are provided.
Training: Use train.py for single-speaker and train_ms.py for multi-speaker models, referencing configuration files.
ONNX Export: Scripts export_onnx.py and infer_onnx.py are available.
Links: Discussion Page for logs and community contributions.

Highlighted Details

Implements key VITS2 features: transformer normalizing flow, speaker-conditioned text encoder, and adversarial duration predictor.
Supports ONNX export for efficient inference.
Includes Gradio demo support.
Offers pretrained checkpoints for LJSpeech.

Maintenance & Community

The project is actively maintained by the author, with contributions and discussions welcomed via GitHub issues. Special mentions are given to contributors for feedback, guidance, and resource support.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The implementation is unofficial and may not perfectly mirror the original VITS2 paper's exact configurations or performance. Some advanced features might still be under development or require expert verification.

vits2_pytorch by p0p4k

Explore Similar Projects

MGM-Omni by JIA-Lab-research

VoiceStar by jasonppy

Meta-voicebox by SpeechifyInc

Comprehensive-Transformer-TTS by keonlee9420

DiffGAN-TTS by keonlee9420

voicebox-pytorch by lucidrains

vits2 by daniilrobnikov

TransformerTTS by spring-media

Spark-TTS by SparkAudio

vits by jaywalnut310

espnet by espnet

TTS by coqui-ai