GPT-SoVITS  by RVC-Boss

Few-shot voice cloning and TTS web UI

created 1 year ago
49,442 stars

Top 0.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a powerful WebUI for few-shot voice cloning and text-to-speech (TTS) synthesis, enabling users to train high-quality TTS models with as little as one minute of voice data. It targets researchers, developers, and hobbyists interested in realistic and efficient voice synthesis, offering cross-lingual capabilities and integrated tools for dataset preparation.

How It Works

GPT-SoVITS leverages a combination of advanced models, including So-VITS, GPT, and potentially others like HuBERT, to achieve its few-shot voice cloning capabilities. The architecture allows for zero-shot inference with very short audio samples (5 seconds) and fine-tuning with minimal data (1 minute) for improved accuracy and realism. Its cross-lingual support is a key advantage, enabling inference in languages different from the training data.

Quick Start & Requirements

  • Installation: Supports Windows (integrated package), Linux (conda/Docker), and macOS.
  • Prerequisites: Python 3.9+, PyTorch (specific versions tested with CUDA 11.8/12.3/12.4, Apple silicon, CPU). FFmpeg is required. Visual Studio 2017 for Korean TTS. Faster Whisper models for English/Japanese ASR.
  • Resources: Requires downloading pretrained models. Docker setup is available.
  • Links: Colab Demo, Huggingface Demo, Discord.

Highlighted Details

  • Zero-shot TTS with 5-second vocal samples.
  • Few-shot TTS fine-tuning with 1 minute of data.
  • Cross-lingual inference (English, Japanese, Korean, Cantonese, Chinese).
  • Integrated tools: voice separation, auto-segmentation, ASR, text labeling.

Maintenance & Community

The project has an active Discord community and appears to be under continuous development with multiple versions released (v2, v3, v4) introducing significant feature improvements.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Models trained on Apple silicon GPUs may have lower quality compared to other devices. Some advanced features like specific ASR models are limited to Chinese. The README notes that V4 is a direct replacement for V3 but requires further testing.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
8
Issues (30d)
198
Star History
4,000 stars in the last 90 days

Explore Similar Projects

Starred by Michael Han Michael Han(Cofounder of Unsloth), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

TTS by coqui-ai

0.4%
42k
Deep learning toolkit for Text-to-Speech, research-tested
created 5 years ago
updated 11 months ago
Feedback? Help us improve.