GPT-SoVITS by RVC-Boss

Few-shot voice cloning and TTS web UI

Created 2 years ago

53,935 stars

Top 0.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

This project provides a powerful WebUI for few-shot voice cloning and text-to-speech (TTS) synthesis, enabling users to train high-quality TTS models with as little as one minute of voice data. It targets researchers, developers, and hobbyists interested in realistic and efficient voice synthesis, offering cross-lingual capabilities and integrated tools for dataset preparation.

How It Works

GPT-SoVITS leverages a combination of advanced models, including So-VITS, GPT, and potentially others like HuBERT, to achieve its few-shot voice cloning capabilities. The architecture allows for zero-shot inference with very short audio samples (5 seconds) and fine-tuning with minimal data (1 minute) for improved accuracy and realism. Its cross-lingual support is a key advantage, enabling inference in languages different from the training data.

Quick Start & Requirements

Installation: Supports Windows (integrated package), Linux (conda/Docker), and macOS.
Prerequisites: Python 3.9+, PyTorch (specific versions tested with CUDA 11.8/12.3/12.4, Apple silicon, CPU). FFmpeg is required. Visual Studio 2017 for Korean TTS. Faster Whisper models for English/Japanese ASR.
Resources: Requires downloading pretrained models. Docker setup is available.
Links: Colab Demo, Huggingface Demo, Discord.

Highlighted Details

Zero-shot TTS with 5-second vocal samples.
Few-shot TTS fine-tuning with 1 minute of data.
Cross-lingual inference (English, Japanese, Korean, Cantonese, Chinese).
Integrated tools: voice separation, auto-segmentation, ASR, text labeling.

Maintenance & Community

The project has an active Discord community and appears to be under continuous development with multiple versions released (v2, v3, v4) introducing significant feature improvements.

Licensing & Compatibility

License: MIT.
Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Models trained on Apple silicon GPUs may have lower quality compared to other devices. Some advanced features like specific ASR models are limited to Chinese. The README notes that V4 is a direct replacement for V3 but requires further testing.

Health Check

Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1,075 stars in the last 30 days