VTuberTalk by jerryuhoo

TTS model for VTuber voice cloning using PaddleSpeech

Created 4 years ago

384 stars

Top 74.5% on SourcePulse

Project Summary

This project provides a Text-to-Speech (TTS) system trained on VTuber voices, enabling users to generate speech from text in a specific VTuber's voice. It is built upon the PaddleSpeech framework and targets users interested in creating custom voiceovers or AI-generated content with unique vocal characteristics.

How It Works

The system leverages FastSpeech2 or SpeedySpeech for acoustic modeling and a vocoder (PWG or HiFiGAN) for waveform generation. The core workflow involves extensive data preprocessing, including audio extraction, noise reduction (Spleeter), audio segmentation, Automatic Speech Recognition (ASR) for text generation, and Montreal Forced Aligner (MFA) for phoneme alignment. This detailed preprocessing pipeline aims to ensure high-quality alignment between audio and text, crucial for accurate TTS synthesis.

Quick Start & Requirements

Installation: Requires Python >= 3.8 and conda. Install dependencies via pip install -r requirements.txt (GPU) or requirements_cpu.txt (CPU).
Prerequisites: ffmpeg (install via package manager or download), PyQt5, sounddevice (for GUI). GPU with CUDA is recommended for training.
Setup: Detailed data preparation steps are outlined, including audio extraction, noise reduction, segmentation, ASR, and MFA alignment. Training and inference scripts are provided.
Links: Demo Video (Note: Link may be outdated).

Highlighted Details

Supports both FastSpeech2 and SpeedySpeech acoustic models.
Includes a GUI for user-friendly operation (static model inference).
Detailed preprocessing pipeline for custom voice training.
Offers options for single-speaker and multi-speaker training.

Maintenance & Community

The project appears to be a personal project with a TODO list indicating ongoing development. No specific community channels (Discord, Slack) or notable contributors are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project depends on PaddleSpeech, which is typically Apache 2.0 licensed. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training requires significant GPU resources and time; CPU training is not feasible. The project lacks readily available pre-trained models or comprehensive tutorials, requiring users to follow the detailed, multi-step data preparation and training process. The ASR process is noted as slow due to batch size limitations.

VTuberTalk by jerryuhoo

Explore Similar Projects

VoiceStar by jasonppy

assem-vc by maum-ai

MoeTTS by luoyily

zamia-speech by gooofy

parrots by shibing624

fish-diffusion by fishaudio

alltalk_tts by erew123

vall-e by lifeiteng

FastSpeech2 by ming024

PaddleSpeech by PaddlePaddle

TTS by coqui-ai

GPT-SoVITS by RVC-Boss