PyTorch code for efficient LLM-based text-to-speech inference
Top 5.0% on sourcepulse
Spark-TTS is an LLM-based text-to-speech system designed for efficient, high-quality voice synthesis and zero-shot voice cloning. It targets researchers and developers needing natural, controllable speech generation across languages, offering a streamlined approach by directly generating audio from LLM-predicted speech tokens.
How It Works
Spark-TTS leverages the Qwen2.5 LLM to directly reconstruct audio from predicted speech tokens, eliminating the need for separate acoustic models. This single-stream, decoupled token approach simplifies the pipeline, enhances efficiency, and enables high-fidelity voice cloning and controllable speech synthesis (e.g., adjusting pitch, speaking rate) without explicit fine-tuning for each voice.
Quick Start & Requirements
git clone https://github.com/SparkAudio/Spark-TTS.git
cd Spark-TTS
conda create -n sparktts -y python=3.12
conda activate sparktts
pip install -r requirements.txt
snapshot_download
or git clone
with git-lfs
.bash example/infer.sh
or python -m cli.inference
.python webui.py --device 0
.git-lfs
. GPU recommended for inference.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is primarily focused on inference and provides a usage disclaimer emphasizing responsible use. Training code and dataset (VoxBox) are listed as future releases (To-Do).
3 months ago
1 day