Spark-TTS  by SparkAudio

PyTorch code for efficient LLM-based text-to-speech inference

Created 6 months ago
10,499 stars

Top 4.8% on SourcePulse

GitHubView on GitHub
Project Summary

Spark-TTS is an LLM-based text-to-speech system designed for efficient, high-quality voice synthesis and zero-shot voice cloning. It targets researchers and developers needing natural, controllable speech generation across languages, offering a streamlined approach by directly generating audio from LLM-predicted speech tokens.

How It Works

Spark-TTS leverages the Qwen2.5 LLM to directly reconstruct audio from predicted speech tokens, eliminating the need for separate acoustic models. This single-stream, decoupled token approach simplifies the pipeline, enhances efficiency, and enables high-fidelity voice cloning and controllable speech synthesis (e.g., adjusting pitch, speaking rate) without explicit fine-tuning for each voice.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via Conda:
    git clone https://github.com/SparkAudio/Spark-TTS.git
    cd Spark-TTS
    conda create -n sparktts -y python=3.12
    conda activate sparktts
    pip install -r requirements.txt
    
  • Model Download: Use Hugging Face snapshot_download or git clone with git-lfs.
  • Basic Usage: Run inference via bash example/infer.sh or python -m cli.inference.
  • Web UI: Start with python webui.py --device 0.
  • Prerequisites: Python 3.12, Conda, git-lfs. GPU recommended for inference.
  • Docs: Runtime section for Triton/TensorRT-LLM deployment.

Highlighted Details

  • Supports zero-shot voice cloning for cross-lingual and code-switching scenarios.
  • Enables controllable speech generation by adjusting parameters like gender, pitch, and speaking rate.
  • Offers Nvidia Triton Inference Serving support with benchmark results showing low latency and high RTF (e.g., 0.1362 RTF on L20 GPU).
  • Bilingual support for Chinese and English.

Maintenance & Community

  • Paper published on arXiv: Spark-TTS.
  • Nvidia Triton Inference Serving support added March 12, 2025.
  • Community contributions noted for Windows installation guide.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Intended for academic research, educational purposes, and legitimate applications. Usage disclaimer strongly advises against unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or illegal activities.

Limitations & Caveats

The project is primarily focused on inference and provides a usage disclaimer emphasizing responsible use. Training code and dataset (VoxBox) are listed as future releases (To-Do).

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
155 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.2%
34k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 5 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.