vits  by jaywalnut310

End-to-end text-to-speech via conditional variational autoencoder

created 4 years ago
7,579 stars

Top 7.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

VITS is an end-to-end text-to-speech (TTS) system designed to generate natural-sounding speech with diverse rhythms and pitches, surpassing traditional two-stage TTS models. It targets researchers and developers seeking high-quality, expressive speech synthesis.

How It Works

VITS employs a conditional variational autoencoder (VAE) augmented with normalizing flows and adversarial learning. This combination enhances generative modeling capabilities, allowing for more expressive audio. A stochastic duration predictor is introduced to model the natural one-to-many relationship between text and speech, enabling variations in rhythm and pitch.

Quick Start & Requirements

  • Install: Clone the repository and install Python requirements from requirements.txt. May require espeak (apt-get install espeak).
  • Data: Download LJ Speech or VCTK datasets and create symbolic links. Preprocessing scripts are provided, with preprocessed phonemes for LJ Speech and VCTK available.
  • Alignment: Build Monotonic Alignment Search (cd monotonic_align; python setup.py build_ext --inplace).
  • Resources: Requires Python >= 3.6. GPU and CUDA are recommended for training.
  • Demo: An interactive TTS demo is available on Colab Notebook.

Highlighted Details

  • Achieves MOS scores comparable to ground truth on LJ Speech.
  • Outperforms existing publicly available TTS systems.
  • Supports single-speaker (LJ Speech) and multi-speaker (VCTK) training.
  • Features a stochastic duration predictor for rhythmic diversity.

Maintenance & Community

  • The project is actively maintained, with contributions noted from Rishikesh for the Colab demo.
  • Links to demo and pretrained models are provided.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README.

Limitations & Caveats

  • Building Monotonic Alignment Search requires manual compilation.
  • Specific hardware requirements (e.g., GPU) are not detailed but are implied for efficient training.
  • The absence of an explicit license may pose compatibility concerns for commercial or closed-source use.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
221 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.