vits2_pytorch  by p0p4k

PyTorch implementation of the VITS2 text-to-speech model

Created 2 years ago
539 stars

Top 59.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an unofficial PyTorch implementation of VITS2, a single-stage text-to-speech (TTS) model designed for improved naturalness, speech characteristic similarity, and computational efficiency. It targets researchers and developers seeking to build or fine-tune advanced TTS systems, offering a fully end-to-end approach that reduces reliance on external phoneme conversion.

How It Works

VITS2 enhances its predecessor by incorporating several architectural improvements. It features a transformer block within the normalizing flow for better sequence modeling, a speaker-conditioned text encoder for multi-speaker synthesis, and a duration predictor with adversarial loss and noise-scaled monotonic alignment search for more robust duration modeling. These components contribute to a more natural and efficient speech synthesis process.

Quick Start & Requirements

  • Install: Clone the repository and install requirements from requirements.txt.
  • Prerequisites: Python >= 3.10, PyTorch 1.13.1+, espeak.
  • Data: Download LJSpeech or VCTK datasets and create symbolic links. Preprocessing scripts are provided.
  • Training: Use train.py for single-speaker and train_ms.py for multi-speaker models, referencing configuration files.
  • ONNX Export: Scripts export_onnx.py and infer_onnx.py are available.
  • Links: Discussion Page for logs and community contributions.

Highlighted Details

  • Implements key VITS2 features: transformer normalizing flow, speaker-conditioned text encoder, and adversarial duration predictor.
  • Supports ONNX export for efficient inference.
  • Includes Gradio demo support.
  • Offers pretrained checkpoints for LJSpeech.

Maintenance & Community

The project is actively maintained by the author, with contributions and discussions welcomed via GitHub issues. Special mentions are given to contributors for feedback, guidance, and resource support.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The implementation is unofficial and may not perfectly mirror the original VITS2 paper's exact configurations or performance. Some advanced features might still be under development or require expert verification.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.2%
9k
End-to-end speech processing toolkit for various speech tasks
Created 7 years ago
Updated 3 days ago
Feedback? Help us improve.