VITS pipeline for fast speaker adaptation TTS and voice conversion
Top 10.2% on sourcepulse
This repository provides a fast fine-tuning pipeline for the VITS Text-to-Speech (TTS) model, enabling rapid speaker adaptation for both TTS synthesis and many-to-many voice conversion. It targets users who want to quickly integrate custom voices into existing VITS models, supporting cloning from short or long audio, and even video sources.
How It Works
The project leverages VITS, a Variational Inference with adversarial learning for end-to-end Text-to-Speech, and focuses on efficient fine-tuning. It allows users to adapt pre-trained models with their own voice data, enabling the model to synthesize speech in new voices or perform voice conversion between any supported speakers. The approach prioritizes speed and ease of use for speaker cloning.
Quick Start & Requirements
pip install -r requirements.txt
and building monotonic_align
. Google Colab is also supported.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Inference is currently limited to Windows. The README does not specify the exact license, which may impact commercial use.
6 months ago
Inactive