UI for text-based voice cloning using a 10-second audio sample
Top 79.5% on sourcepulse
This project provides a user-friendly interface for XTTS-2, a text-to-speech model capable of voice cloning with as little as 10 seconds of audio. It targets users who need to generate synthetic speech in multiple languages with custom voices, offering a simplified workflow compared to direct model interaction.
How It Works
The UI leverages the XTTS-2 model, specifically tts_models/multilingual/multi-dataset/xtts_v2
, to perform voice cloning. Users can upload or record a short audio sample (around 10 seconds) of the target voice and provide text input. The system then synthesizes speech in the target voice, supporting 16 languages.
Quick Start & Requirements
pip install -r requirements.txt
. Upgrade the TTS package with pip install --upgrade TTS
.cu121
) and CUDA 11.8 (cu118
) compatible GPUs. Users without a GPU should follow PyTorch's official instructions.fugashi
and potentially downloading the unidic
dictionary.Highlighted Details
Maintenance & Community
The project is based on kanttouchthis/text_generation_webui_xtts. Further community and roadmap information is not explicitly detailed in the README.
Licensing & Compatibility
The README does not explicitly state the license for this UI project. However, it references the XTTS-v2 model, which is subject to Coqui's Commercial Product License Agreement (CPML), accessible at https://coqui.ai/cpml.txt. Users must agree to these terms.
Limitations & Caveats
The README notes that the quality is not "EL level" and may not meet all expectations. Users must explicitly agree to the terms of service to use the XTTS model. There's a mention of potential model re-downloading issues, referencing GitHub Issue 4723.
7 months ago
Inactive