GPT-SoVITS  by RVC-Boss

Few-shot voice cloning and TTS web UI

Created 1 year ago
50,946 stars

Top 0.5% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a powerful WebUI for few-shot voice cloning and text-to-speech (TTS) synthesis, enabling users to train high-quality TTS models with as little as one minute of voice data. It targets researchers, developers, and hobbyists interested in realistic and efficient voice synthesis, offering cross-lingual capabilities and integrated tools for dataset preparation.

How It Works

GPT-SoVITS leverages a combination of advanced models, including So-VITS, GPT, and potentially others like HuBERT, to achieve its few-shot voice cloning capabilities. The architecture allows for zero-shot inference with very short audio samples (5 seconds) and fine-tuning with minimal data (1 minute) for improved accuracy and realism. Its cross-lingual support is a key advantage, enabling inference in languages different from the training data.

Quick Start & Requirements

  • Installation: Supports Windows (integrated package), Linux (conda/Docker), and macOS.
  • Prerequisites: Python 3.9+, PyTorch (specific versions tested with CUDA 11.8/12.3/12.4, Apple silicon, CPU). FFmpeg is required. Visual Studio 2017 for Korean TTS. Faster Whisper models for English/Japanese ASR.
  • Resources: Requires downloading pretrained models. Docker setup is available.
  • Links: Colab Demo, Huggingface Demo, Discord.

Highlighted Details

  • Zero-shot TTS with 5-second vocal samples.
  • Few-shot TTS fine-tuning with 1 minute of data.
  • Cross-lingual inference (English, Japanese, Korean, Cantonese, Chinese).
  • Integrated tools: voice separation, auto-segmentation, ASR, text labeling.

Maintenance & Community

The project has an active Discord community and appears to be under continuous development with multiple versions released (v2, v3, v4) introducing significant feature improvements.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Models trained on Apple silicon GPUs may have lower quality compared to other devices. Some advanced features like specific ASR models are limited to Chinese. The README notes that V4 is a direct replacement for V3 but requires further testing.

Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
23
Star History
892 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
1 more.

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
Created 1 year ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.2%
34k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 5 months ago
Feedback? Help us improve.