vits2  by daniilrobnikov

Unofficial VITS2 implementation for single-stage text-to-speech research

Created 2 years ago
603 stars

Top 54.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

VITS2 is a single-stage text-to-speech (TTS) model aiming to improve naturalness, efficiency, and reduce reliance on phoneme conversion compared to prior single-stage approaches. It targets researchers and developers working on TTS systems, offering a more end-to-end solution.

How It Works

VITS2 builds upon the VITS architecture, introducing architectural improvements and training mechanisms. Key enhancements include Normalizing Flows, a Duration Predictor, and an updated Text Encoder designed for speaker conditioning. These changes aim to synthesize more natural speech, improve multi-speaker similarity, and increase training/inference efficiency, while mitigating the strong dependence on phoneme conversion seen in earlier models.

Quick Start & Requirements

  • Installation: Clone the repository and set up a Conda environment with Python 3.11 and PyTorch 2.0. Install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python 3.11, PyTorch 2.0, espeak-ng (for phonemization). Datasets (LJSpeech, VCTK, or custom) require preprocessing into mel-spectrograms.
  • Setup: Requires downloading and preprocessing datasets. Training examples are provided for LJ Speech, VCTK, and custom datasets.
  • Links: Demo, Paper.

Highlighted Details

  • Focuses on improving quality and efficiency of single-stage TTS.
  • Reduces dependence on phoneme conversion for a more end-to-end approach.
  • Incorporates Normalizing Flows and an improved Duration Predictor.
  • Supports multi-speaker TTS with speaker conditioning.

Maintenance & Community

This is an unofficial implementation. The project is a work in progress with a TODO list indicating planned features and improvements.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

This is an unofficial implementation and is marked as a work in progress. Several features are still listed as "In progress" or "TODO" in the project's development roadmap.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
3 more.

tacotron2 by NVIDIA

0.0%
5k
PyTorch implementation for text-to-speech synthesis
Created 7 years ago
Updated 1 year ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.2%
9k
End-to-end speech processing toolkit for various speech tasks
Created 7 years ago
Updated 3 days ago
Feedback? Help us improve.