VITS is an end-to-end text-to-speech (TTS) system designed to generate natural-sounding speech with diverse rhythms and pitches, surpassing traditional two-stage TTS models. It targets researchers and developers seeking high-quality, expressive speech synthesis.
How It Works
VITS employs a conditional variational autoencoder (VAE) augmented with normalizing flows and adversarial learning. This combination enhances generative modeling capabilities, allowing for more expressive audio. A stochastic duration predictor is introduced to model the natural one-to-many relationship between text and speech, enabling variations in rhythm and pitch.
Quick Start & Requirements
- Install: Clone the repository and install Python requirements from
requirements.txt
. May require espeak
(apt-get install espeak
).
- Data: Download LJ Speech or VCTK datasets and create symbolic links. Preprocessing scripts are provided, with preprocessed phonemes for LJ Speech and VCTK available.
- Alignment: Build Monotonic Alignment Search (
cd monotonic_align; python setup.py build_ext --inplace
).
- Resources: Requires Python >= 3.6. GPU and CUDA are recommended for training.
- Demo: An interactive TTS demo is available on Colab Notebook.
Highlighted Details
- Achieves MOS scores comparable to ground truth on LJ Speech.
- Outperforms existing publicly available TTS systems.
- Supports single-speaker (LJ Speech) and multi-speaker (VCTK) training.
- Features a stochastic duration predictor for rhythmic diversity.
Maintenance & Community
- The project is actively maintained, with contributions noted from Rishikesh for the Colab demo.
- Links to demo and pretrained models are provided.
Licensing & Compatibility
- The repository does not explicitly state a license in the provided README.
Limitations & Caveats
- Building Monotonic Alignment Search requires manual compilation.
- Specific hardware requirements (e.g., GPU) are not detailed but are implied for efficient training.
- The absence of an explicit license may pose compatibility concerns for commercial or closed-source use.