Open source implementation of Microsoft's VALL-E X zero-shot TTS model
Top 6.7% on sourcepulse
This repository provides an open-source implementation of Microsoft's VALL-E X, a multilingual text-to-speech (TTS) and zero-shot voice cloning model. It enables users to synthesize speech in English, Chinese, and Japanese, clone voices from short audio samples, and control speech emotion and accent.
How It Works
VALL-E X generates audio in a GPT-style by predicting audio tokens quantized by EnCodec. This approach allows for efficient and high-quality speech synthesis. Key advantages include its lightweight nature, faster inference compared to similar models like Bark, and superior performance on Chinese and Japanese, with capabilities for cross-lingual synthesis and easy voice cloning.
Quick Start & Requirements
pip install -r requirements.txt
.ffmpeg
is required for prompt processing and must be in the system's PATH.medium.pt
will be downloaded automatically on first run or can be manually placed in ./checkpoints/
and ./whisper/
respectively.Highlighted Details
python -X utf8 launch-ui.py
).Maintenance & Community
The project is actively maintained with recent updates including AR decoder batch decoding and replacing EnCodec with Vocos decoder. Community support is available via Discord.
Licensing & Compatibility
Licensed under the MIT License, allowing for commercial use and integration with closed-source projects.
Limitations & Caveats
Generated audio is limited to approximately 22 seconds due to Transformer's quadratic complexity. The project does not release training code, referencing an external implementation. Fine-tuning for better voice adaptation and .bat
scripts are listed as future TODOs.
1 year ago
1 day