FlashTTS  by HuiResearch

TTS tool for high-quality Chinese speech synthesis and voice cloning

Created 6 months ago
523 stars

Top 60.3% on SourcePulse

GitHubView on GitHub
Project Summary

FlashTTS provides high-quality Chinese text-to-speech (TTS) and zero-shot voice cloning, leveraging advanced models like SparkTTS, OrpheusTTS, and MegaTTS 3. It targets developers and users needing natural-sounding speech for applications such as dubbing, reading, accessibility, and virtual characters, offering a user-friendly web interface for quick generation.

How It Works

FlashTTS utilizes a modular architecture, supporting multiple high-performance inference backends including vllm, sglang, llama-cpp, mlx-lm, and tensorrt-llm. This flexibility allows users to choose the most efficient engine for their hardware and performance needs. It features dynamic batching and asynchronous queues for high concurrency, enabling it to handle significant request loads. The system offers fine-grained control over speech parameters like pitch, speed, and emotion, and supports streaming TTS for improved interactivity.

Quick Start & Requirements

  • Install via pip: pip install flashtts
  • Recommended Python version: 3.8 - 3.12
  • GPU acceleration is highly recommended for optimal performance, with specific backends like vllm and sglang demonstrating significant speedups.
  • Local inference command: flashtts infer -i "text" -o output.wav -m ./models/your_model -b vllm
  • Deployment command: flashtts serve --model_path Spark-TTS-0.5B --backend vllm --llm_device cuda
  • Documentation: 📘 Documentation
  • Quick start guide: installation.md, quick_start.md
  • Deployment guide: server.md

Highlighted Details

  • Supports multiple inference backends (vllm, sglang, llama-cpp, etc.) for accelerated inference.
  • Achieves low Real-Time Factor (RTF) with backends like sglang (0.04 RTF for long text on A800 GPU).
  • Offers fine-grained control over speech parameters (pitch, speed, temperature, emotion tags).
  • Features long text synthesis with consistent voice timbre and streaming TTS for reduced latency.
  • Supports multi-character dialogue synthesis within the same text.

Maintenance & Community

The project is associated with HuiResearch. Further community engagement details such as Discord/Slack links or a roadmap are not explicitly provided in the README.

Licensing & Compatibility

The project inherits the license from Spark-TTS. The specific license details are available in the LICENSE file. It is intended for academic research, education, and legitimate uses like accessibility, but explicitly prohibits fraudulent or illegal applications such as deepfakes.

Limitations & Caveats

MegaTTS 3's WaveVAE encoder is not publicly released due to security considerations, requiring users to follow official instructions for download. SparkTTS weights require bfloat16 or float32 precision; float16 will cause errors. For extended silence issues, increasing repetition_penalty is suggested.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
4
Star History
13 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.