Discover and explore top open-source AI tools and projects—updated daily.
jaywalnut310End-to-end text-to-speech via conditional variational autoencoder
Top 6.7% on SourcePulse
VITS is an end-to-end text-to-speech (TTS) system designed to generate natural-sounding speech with diverse rhythms and pitches, surpassing traditional two-stage TTS models. It targets researchers and developers seeking high-quality, expressive speech synthesis.
How It Works
VITS employs a conditional variational autoencoder (VAE) augmented with normalizing flows and adversarial learning. This combination enhances generative modeling capabilities, allowing for more expressive audio. A stochastic duration predictor is introduced to model the natural one-to-many relationship between text and speech, enabling variations in rhythm and pitch.
Quick Start & Requirements
requirements.txt. May require espeak (apt-get install espeak).cd monotonic_align; python setup.py build_ext --inplace).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive
yl4579