FlashTTS  by HuiResearch

TTS tool for high-quality Chinese speech synthesis and voice cloning

created 4 months ago
497 stars

Top 63.3% on sourcepulse

GitHubView on GitHub
Project Summary

FlashTTS provides high-quality Chinese text-to-speech (TTS) and zero-shot voice cloning, leveraging advanced models like SparkTTS, OrpheusTTS, and MegaTTS 3. It targets developers and users needing natural-sounding speech for applications such as dubbing, reading, accessibility, and virtual characters, offering a user-friendly web interface for quick generation.

How It Works

FlashTTS utilizes a modular architecture, supporting multiple high-performance inference backends including vllm, sglang, llama-cpp, mlx-lm, and tensorrt-llm. This flexibility allows users to choose the most efficient engine for their hardware and performance needs. It features dynamic batching and asynchronous queues for high concurrency, enabling it to handle significant request loads. The system offers fine-grained control over speech parameters like pitch, speed, and emotion, and supports streaming TTS for improved interactivity.

Quick Start & Requirements

  • Install via pip: pip install flashtts
  • Recommended Python version: 3.8 - 3.12
  • GPU acceleration is highly recommended for optimal performance, with specific backends like vllm and sglang demonstrating significant speedups.
  • Local inference command: flashtts infer -i "text" -o output.wav -m ./models/your_model -b vllm
  • Deployment command: flashtts serve --model_path Spark-TTS-0.5B --backend vllm --llm_device cuda
  • Documentation: 📘 Documentation
  • Quick start guide: installation.md, quick_start.md
  • Deployment guide: server.md

Highlighted Details

  • Supports multiple inference backends (vllm, sglang, llama-cpp, etc.) for accelerated inference.
  • Achieves low Real-Time Factor (RTF) with backends like sglang (0.04 RTF for long text on A800 GPU).
  • Offers fine-grained control over speech parameters (pitch, speed, temperature, emotion tags).
  • Features long text synthesis with consistent voice timbre and streaming TTS for reduced latency.
  • Supports multi-character dialogue synthesis within the same text.

Maintenance & Community

The project is associated with HuiResearch. Further community engagement details such as Discord/Slack links or a roadmap are not explicitly provided in the README.

Licensing & Compatibility

The project inherits the license from Spark-TTS. The specific license details are available in the LICENSE file. It is intended for academic research, education, and legitimate uses like accessibility, but explicitly prohibits fraudulent or illegal applications such as deepfakes.

Limitations & Caveats

MegaTTS 3's WaveVAE encoder is not publicly released due to security considerations, requiring users to follow official instructions for download. SparkTTS weights require bfloat16 or float32 precision; float16 will cause errors. For extended silence issues, increasing repetition_penalty is suggested.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
154 stars in the last 90 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.5%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 1 day ago
Feedback? Help us improve.