FlashTTS by HuiResearch

TTS tool for high-quality Chinese speech synthesis and voice cloning

Created 10 months ago

578 stars

Top 55.9% on SourcePulse

Project Summary

FlashTTS provides high-quality Chinese text-to-speech (TTS) and zero-shot voice cloning, leveraging advanced models like SparkTTS, OrpheusTTS, and MegaTTS 3. It targets developers and users needing natural-sounding speech for applications such as dubbing, reading, accessibility, and virtual characters, offering a user-friendly web interface for quick generation.

How It Works

FlashTTS utilizes a modular architecture, supporting multiple high-performance inference backends including vllm, sglang, llama-cpp, mlx-lm, and tensorrt-llm. This flexibility allows users to choose the most efficient engine for their hardware and performance needs. It features dynamic batching and asynchronous queues for high concurrency, enabling it to handle significant request loads. The system offers fine-grained control over speech parameters like pitch, speed, and emotion, and supports streaming TTS for improved interactivity.

Quick Start & Requirements

Install via pip: pip install flashtts
Recommended Python version: 3.8 - 3.12
GPU acceleration is highly recommended for optimal performance, with specific backends like vllm and sglang demonstrating significant speedups.
Local inference command: flashtts infer -i "text" -o output.wav -m ./models/your_model -b vllm
Deployment command: flashtts serve --model_path Spark-TTS-0.5B --backend vllm --llm_device cuda
Documentation: 📘 Documentation
Quick start guide: installation.md, quick_start.md
Deployment guide: server.md

Highlighted Details

Supports multiple inference backends (vllm, sglang, llama-cpp, etc.) for accelerated inference.
Achieves low Real-Time Factor (RTF) with backends like sglang (0.04 RTF for long text on A800 GPU).
Offers fine-grained control over speech parameters (pitch, speed, temperature, emotion tags).
Features long text synthesis with consistent voice timbre and streaming TTS for reduced latency.
Supports multi-character dialogue synthesis within the same text.

Maintenance & Community

The project is associated with HuiResearch. Further community engagement details such as Discord/Slack links or a roadmap are not explicitly provided in the README.

Licensing & Compatibility

The project inherits the license from Spark-TTS. The specific license details are available in the LICENSE file. It is intended for academic research, education, and legitimate uses like accessibility, but explicitly prohibits fraudulent or illegal applications such as deepfakes.

Limitations & Caveats

MegaTTS 3's WaveVAE encoder is not publicly released due to security considerations, requiring users to follow official instructions for download. SparkTTS weights require bfloat16 or float32 precision; float16 will cause errors. For extended silence issues, increasing repetition_penalty is suggested.

FlashTTS by HuiResearch

Explore Similar Projects

Voila by maitrix-org

ComfyUI_IndexTTS by billwuhao

SonicVale by xcLee001

Dia-TTS-Server by devnen

FireRedTTS by FireRedTeam

vits-simple-api by Artrajz

xtts-webui by daswer123

alltalk_tts by erew123

Orpheus-TTS by canopyai

FastSpeech2 by ming024

dia by nari-labs

CosyVoice by FunAudioLLM