Spark-TTS  by SparkAudio

PyTorch code for efficient LLM-based text-to-speech inference

created 5 months ago
10,172 stars

Top 5.0% on sourcepulse

GitHubView on GitHub
Project Summary

Spark-TTS is an LLM-based text-to-speech system designed for efficient, high-quality voice synthesis and zero-shot voice cloning. It targets researchers and developers needing natural, controllable speech generation across languages, offering a streamlined approach by directly generating audio from LLM-predicted speech tokens.

How It Works

Spark-TTS leverages the Qwen2.5 LLM to directly reconstruct audio from predicted speech tokens, eliminating the need for separate acoustic models. This single-stream, decoupled token approach simplifies the pipeline, enhances efficiency, and enables high-fidelity voice cloning and controllable speech synthesis (e.g., adjusting pitch, speaking rate) without explicit fine-tuning for each voice.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via Conda:
    git clone https://github.com/SparkAudio/Spark-TTS.git
    cd Spark-TTS
    conda create -n sparktts -y python=3.12
    conda activate sparktts
    pip install -r requirements.txt
    
  • Model Download: Use Hugging Face snapshot_download or git clone with git-lfs.
  • Basic Usage: Run inference via bash example/infer.sh or python -m cli.inference.
  • Web UI: Start with python webui.py --device 0.
  • Prerequisites: Python 3.12, Conda, git-lfs. GPU recommended for inference.
  • Docs: Runtime section for Triton/TensorRT-LLM deployment.

Highlighted Details

  • Supports zero-shot voice cloning for cross-lingual and code-switching scenarios.
  • Enables controllable speech generation by adjusting parameters like gender, pitch, and speaking rate.
  • Offers Nvidia Triton Inference Serving support with benchmark results showing low latency and high RTF (e.g., 0.1362 RTF on L20 GPU).
  • Bilingual support for Chinese and English.

Maintenance & Community

  • Paper published on arXiv: Spark-TTS.
  • Nvidia Triton Inference Serving support added March 12, 2025.
  • Community contributions noted for Windows installation guide.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Intended for academic research, educational purposes, and legitimate applications. Usage disclaimer strongly advises against unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or illegal activities.

Limitations & Caveats

The project is primarily focused on inference and provides a usage disclaimer emphasizing responsible use. Training code and dataset (VoxBox) are listed as future releases (To-Do).

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
1,204 stars in the last 90 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.5%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 1 day ago
Starred by Michael Han Michael Han(Cofounder of Unsloth), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

TTS by coqui-ai

0.3%
42k
Deep learning toolkit for Text-to-Speech, research-tested
created 5 years ago
updated 11 months ago
Feedback? Help us improve.