vits by jaywalnut310

End-to-end text-to-speech via conditional variational autoencoder

Created 4 years ago

7,785 stars

Top 6.6% on SourcePulse

1 Expert Loves This Project

osanseviero

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

VITS is an end-to-end text-to-speech (TTS) system designed to generate natural-sounding speech with diverse rhythms and pitches, surpassing traditional two-stage TTS models. It targets researchers and developers seeking high-quality, expressive speech synthesis.

How It Works

VITS employs a conditional variational autoencoder (VAE) augmented with normalizing flows and adversarial learning. This combination enhances generative modeling capabilities, allowing for more expressive audio. A stochastic duration predictor is introduced to model the natural one-to-many relationship between text and speech, enabling variations in rhythm and pitch.

Quick Start & Requirements

Install: Clone the repository and install Python requirements from requirements.txt. May require espeak (apt-get install espeak).
Data: Download LJ Speech or VCTK datasets and create symbolic links. Preprocessing scripts are provided, with preprocessed phonemes for LJ Speech and VCTK available.
Alignment: Build Monotonic Alignment Search (cd monotonic_align; python setup.py build_ext --inplace).
Resources: Requires Python >= 3.6. GPU and CUDA are recommended for training.
Demo: An interactive TTS demo is available on Colab Notebook.

Highlighted Details

Achieves MOS scores comparable to ground truth on LJ Speech.
Outperforms existing publicly available TTS systems.
Supports single-speaker (LJ Speech) and multi-speaker (VCTK) training.
Features a stochastic duration predictor for rhythmic diversity.

Maintenance & Community

The project is actively maintained, with contributions noted from Rishikesh for the Colab demo.
Links to demo and pretrained models are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

Building Monotonic Alignment Search requires manual compilation.
Specific hardware requirements (e.g., GPU) are not detailed but are implied for efficient training.
The absence of an explicit license may pose compatibility concerns for commercial or closed-source use.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

37 stars in the last 30 days

Explore Similar Projects

MGM-Omni by JIA-Lab-research

Omni-modal LLM for personalized long-horizon speech and multi-input understanding

Created 4 months ago

Updated 1 month ago

Meta-voicebox by SpeechifyInc

PyTorch implementation of Meta's Voicebox speech model

Created 2 years ago

Updated 2 years ago

pits by anonymous-pits

End-to-end TTS for pitch-controllable speech, no external pitch predictor

Created 2 years ago

Updated 2 years ago

GenerSpeech by Rongjiehuang

Text-to-speech model for zero-shot style transfer of custom voice

Created 3 years ago

Updated 1 year ago

DiffGAN-TTS by keonlee9420

PyTorch implementation for text-to-speech using denoising diffusion GANs

Created 3 years ago

Updated 3 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

FastDiff by Rongjiehuang

PyTorch implementation for fast, high-fidelity speech synthesis via conditional diffusion

Created 4 years ago

Updated 1 year ago

StyleTTS by yl4579

Style-based generative model for natural, diverse text-to-speech synthesis

Created 3 years ago

Updated 1 year ago

vits2_pytorch by p0p4k

PyTorch implementation of the VITS2 text-to-speech model

Created 2 years ago

Updated 1 year ago

Lip2Wav by Rudrabha

Lip-to-speech synthesis for generating speech from lip movements

Created 5 years ago

Updated 2 years ago

glow-tts by jaywalnut310

Generative flow for text-to-speech research paper

Created 5 years ago

Updated 3 years ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

1 more.

higgs-audio by boson-ai

Expressive text-to-audio generation model

Created 5 months ago

Updated 3 months ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

4 more.

StyleTTS2 by yl4579

Text-to-speech model achieving human-level synthesis

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.