TTS model for VTuber voice cloning using PaddleSpeech
Top 76.3% on sourcepulse
This project provides a Text-to-Speech (TTS) system trained on VTuber voices, enabling users to generate speech from text in a specific VTuber's voice. It is built upon the PaddleSpeech framework and targets users interested in creating custom voiceovers or AI-generated content with unique vocal characteristics.
How It Works
The system leverages FastSpeech2 or SpeedySpeech for acoustic modeling and a vocoder (PWG or HiFiGAN) for waveform generation. The core workflow involves extensive data preprocessing, including audio extraction, noise reduction (Spleeter), audio segmentation, Automatic Speech Recognition (ASR) for text generation, and Montreal Forced Aligner (MFA) for phoneme alignment. This detailed preprocessing pipeline aims to ensure high-quality alignment between audio and text, crucial for accurate TTS synthesis.
Quick Start & Requirements
conda
. Install dependencies via pip install -r requirements.txt
(GPU) or requirements_cpu.txt
(CPU).ffmpeg
(install via package manager or download), PyQt5
, sounddevice
(for GUI). GPU with CUDA is recommended for training.Highlighted Details
Maintenance & Community
The project appears to be a personal project with a TODO list indicating ongoing development. No specific community channels (Discord, Slack) or notable contributors are mentioned in the README.
Licensing & Compatibility
The README does not explicitly state a license. The project depends on PaddleSpeech, which is typically Apache 2.0 licensed. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Training requires significant GPU resources and time; CPU training is not feasible. The project lacks readily available pre-trained models or comprehensive tutorials, requiring users to follow the detailed, multi-step data preparation and training process. The ASR process is noted as slow due to batch size limitations.
2 years ago
Inactive