Spark-TTS by SparkAudio

PyTorch code for efficient LLM-based text-to-speech inference

Created 10 months ago

10,887 stars

Top 4.7% on SourcePulse

Project Summary

Spark-TTS is an LLM-based text-to-speech system designed for efficient, high-quality voice synthesis and zero-shot voice cloning. It targets researchers and developers needing natural, controllable speech generation across languages, offering a streamlined approach by directly generating audio from LLM-predicted speech tokens.

How It Works

Spark-TTS leverages the Qwen2.5 LLM to directly reconstruct audio from predicted speech tokens, eliminating the need for separate acoustic models. This single-stream, decoupled token approach simplifies the pipeline, enhances efficiency, and enables high-fidelity voice cloning and controllable speech synthesis (e.g., adjusting pitch, speaking rate) without explicit fine-tuning for each voice.

Quick Start & Requirements

Install: Clone the repository and install dependencies via Conda:

git clone https://github.com/SparkAudio/Spark-TTS.git
cd Spark-TTS
conda create -n sparktts -y python=3.12
conda activate sparktts
pip install -r requirements.txt

Model Download: Use Hugging Face snapshot_download or git clone with git-lfs.
Basic Usage: Run inference via bash example/infer.sh or python -m cli.inference.
Web UI: Start with python webui.py --device 0.
Prerequisites: Python 3.12, Conda, git-lfs. GPU recommended for inference.
Docs: Runtime section for Triton/TensorRT-LLM deployment.

Highlighted Details

Supports zero-shot voice cloning for cross-lingual and code-switching scenarios.
Enables controllable speech generation by adjusting parameters like gender, pitch, and speaking rate.
Offers Nvidia Triton Inference Serving support with benchmark results showing low latency and high RTF (e.g., 0.1362 RTF on L20 GPU).
Bilingual support for Chinese and English.

Maintenance & Community

Paper published on arXiv: Spark-TTS.
Nvidia Triton Inference Serving support added March 12, 2025.
Community contributions noted for Windows installation guide.

Licensing & Compatibility

License: Not explicitly stated in the README.
Compatibility: Intended for academic research, educational purposes, and legitimate applications. Usage disclaimer strongly advises against unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or illegal activities.

Limitations & Caveats

The project is primarily focused on inference and provides a usage disclaimer emphasizing responsible use. Training code and dataset (VoxBox) are listed as future releases (To-Do).

Spark-TTS by SparkAudio

Explore Similar Projects

Meta-voicebox by SpeechifyInc

Cross-Lingual-Voice-Cloning by deterministic-algorithms-lab

FireRedTTS by FireRedTeam

sesame_csm_openai by phildougherty

vits-simple-api by Artrajz

MARS5-TTS by Camb-ai

WhisperSpeech by WhisperSpeech

metavoice-src by metavoiceio

Zonos by Zyphra

CosyVoice by FunAudioLLM

OpenVoice by myshell-ai

GPT-SoVITS by RVC-Boss