Discover and explore top open-source AI tools and projects—updated daily.
End-to-end text-to-speech via conditional variational autoencoder
Top 6.8% on SourcePulse
VITS is an end-to-end text-to-speech (TTS) system designed to generate natural-sounding speech with diverse rhythms and pitches, surpassing traditional two-stage TTS models. It targets researchers and developers seeking high-quality, expressive speech synthesis.
How It Works
VITS employs a conditional variational autoencoder (VAE) augmented with normalizing flows and adversarial learning. This combination enhances generative modeling capabilities, allowing for more expressive audio. A stochastic duration predictor is introduced to model the natural one-to-many relationship between text and speech, enabling variations in rhythm and pitch.
Quick Start & Requirements
requirements.txt
. May require espeak
(apt-get install espeak
).cd monotonic_align; python setup.py build_ext --inplace
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive