VITS-fast-fine-tuning by Plachtaa

VITS pipeline for fast speaker adaptation TTS and voice conversion

Created 2 years ago

5,002 stars

Top 9.9% on SourcePulse

Project Summary

This repository provides a fast fine-tuning pipeline for the VITS Text-to-Speech (TTS) model, enabling rapid speaker adaptation for both TTS synthesis and many-to-many voice conversion. It targets users who want to quickly integrate custom voices into existing VITS models, supporting cloning from short or long audio, and even video sources.

How It Works

The project leverages VITS, a Variational Inference with adversarial learning for end-to-end Text-to-Speech, and focuses on efficient fine-tuning. It allows users to adapt pre-trained models with their own voice data, enabling the model to synthesize speech in new voices or perform voice conversion between any supported speakers. The approach prioritizes speed and ease of use for speaker cloning.

Quick Start & Requirements

Install: Local training requires pip install -r requirements.txt and building monotonic_align. Google Colab is also supported.
Prerequisites: Python 3.x, ffmpeg (for voice conversion).
Setup: Fine-tuning can take 20 minutes to 2 hours depending on data size. Inference is Windows-only via an executable.
Docs: LOCAL.md for local training, DATA.MD for data preparation.

Highlighted Details

Supports cloning voices from 10+ short audio clips, long audio (>= 3 min), videos (>= 3 min), or Bilibili video links.
Enables many-to-many voice conversion between any speakers added to the model.
Offers TTS synthesis in English, Japanese, and Chinese with custom and preset characters.
Inference is available via a Windows executable or command-line interface.

Maintenance & Community

Active development with contributions from multiple authors.
Community support available via Discord server and GitHub Issues.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Inference is currently limited to Windows. The README does not specify the exact license, which may impact commercial use.

Health Check

Last Commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

16 stars in the last 30 days

Explore Similar Projects

MahaTTS by dubverse-ai

Open-source TTS model for multilingual voice cloning

Created 2 years ago

Updated 1 year ago

FireRedTTS by FireRedTeam

LLM-empowered TTS system for research

Created 1 year ago

Updated 3 months ago

Easy-Voice-Toolkit by Spr-Aachen

Local AI voice toolkit for audio processing, recognition, transcription, and conversion

Created 2 years ago

Updated 3 weeks ago

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

speech-synthesis-paper by wenet-e2e

Speech synthesis papers list

Created 5 years ago

Updated 2 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT), and

3 more.

WhisperSpeech by WhisperSpeech

Open-source text-to-speech system built by inverting Whisper

Created 2 years ago

Updated 4 weeks ago

seed-vc by Plachtaa

CLI tool for zero-shot voice/singing voice conversion, supporting real-time

Created 1 year ago

Updated 8 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Pietro Schirano

Pietro Schirano(Founder of MagicPath), and

2 more.

metavoice-src by metavoiceio

TTS model for human-like, expressive speech

Created 1 year ago

Updated 1 year ago

whisper-vits-svc by PlayVoice

Singing voice conversion engine based on VITS

Created 3 years ago

Updated 1 year ago

Spark-TTS by SparkAudio

PyTorch code for efficient LLM-based text-to-speech inference

Created 10 months ago

Updated 9 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen), and

6 more.

OpenVoice by myshell-ai

Audio foundation model for versatile, instant voice cloning

Created 2 years ago

Updated 8 months ago

Starred by

Jason Huggins

Jason Huggins(Creator of Selenium),

Michael Han

Michael Han(Cofounder of Unsloth), and

11 more.

TTS by coqui-ai

Deep learning toolkit for Text-to-Speech, research-tested

Created 5 years ago

Updated 1 year ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

Few-shot voice cloning and TTS web UI

Created 2 years ago

Updated 1 week ago

Feedback? Help us improve.