VTuberTalk  by jerryuhoo

TTS model for VTuber voice cloning using PaddleSpeech

created 3 years ago
378 stars

Top 76.3% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Text-to-Speech (TTS) system trained on VTuber voices, enabling users to generate speech from text in a specific VTuber's voice. It is built upon the PaddleSpeech framework and targets users interested in creating custom voiceovers or AI-generated content with unique vocal characteristics.

How It Works

The system leverages FastSpeech2 or SpeedySpeech for acoustic modeling and a vocoder (PWG or HiFiGAN) for waveform generation. The core workflow involves extensive data preprocessing, including audio extraction, noise reduction (Spleeter), audio segmentation, Automatic Speech Recognition (ASR) for text generation, and Montreal Forced Aligner (MFA) for phoneme alignment. This detailed preprocessing pipeline aims to ensure high-quality alignment between audio and text, crucial for accurate TTS synthesis.

Quick Start & Requirements

  • Installation: Requires Python >= 3.8 and conda. Install dependencies via pip install -r requirements.txt (GPU) or requirements_cpu.txt (CPU).
  • Prerequisites: ffmpeg (install via package manager or download), PyQt5, sounddevice (for GUI). GPU with CUDA is recommended for training.
  • Setup: Detailed data preparation steps are outlined, including audio extraction, noise reduction, segmentation, ASR, and MFA alignment. Training and inference scripts are provided.
  • Links: Demo Video (Note: Link may be outdated).

Highlighted Details

  • Supports both FastSpeech2 and SpeedySpeech acoustic models.
  • Includes a GUI for user-friendly operation (static model inference).
  • Detailed preprocessing pipeline for custom voice training.
  • Offers options for single-speaker and multi-speaker training.

Maintenance & Community

The project appears to be a personal project with a TODO list indicating ongoing development. No specific community channels (Discord, Slack) or notable contributors are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project depends on PaddleSpeech, which is typically Apache 2.0 licensed. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training requires significant GPU resources and time; CPU training is not feasible. The project lacks readily available pre-trained models or comprehensive tutorials, requiring users to follow the detailed, multi-step data preparation and training process. The ASR process is noted as slow due to batch size limitations.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.