faster-qwen3-tts by andimarafioti

Real-time TTS inference acceleration

Created 3 months ago

1,112 stars

Top 34.1% on SourcePulse

Project Summary

Summary

This project accelerates Qwen3-TTS, a text-to-speech model, by leveraging CUDA graphs for real-time inference. It targets developers and researchers requiring low-latency, high-throughput speech generation, offering significant performance improvements over standard PyTorch implementations without relying on external libraries like vLLM or Triton. The primary benefit is achieving faster-than-real-time (RTF) audio synthesis, enabling applications like live voice assistants or interactive content generation.

How It Works

The core innovation lies in capturing the entire Qwen3-TTS decode step—comprising the Talker and Code Predictor transformers—into a single torch.cuda.CUDAGraph. This eliminates the overhead of numerous small CUDA kernel launches and Python interpreter calls per step, replaying the entire sequence as one optimized GPU operation. It employs a static KV cache with padded attention to manage variable-length sequences within fixed-size tensors, contrasting with the original dynamic cache approach. This static capture and replay mechanism is key to its performance gains.

Quick Start & Requirements

Installation: pip install faster-qwen3-tts
Prerequisites: Python 3.10+, NVIDIA GPU with CUDA.
Demo UI: Install with pip install -e ".[demo]" and run python demo/server.py.
Links: GitHub Repository (implied by context).

Highlighted Details

Achieves significant speedups (up to 9.8x TTFA improvement reported on RTX 4060) using CUDA graphs.
Supports both streaming (yielding audio chunks during generation) and non-streaming output modes.
Offers multiple generation modes: voice cloning (simple x-vector or advanced ICL), custom voice presets, and instruction-based voice design.
Includes a server mode for keeping models loaded and ready for inference.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap were found in the provided text.

Licensing & Compatibility

License: MIT.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Numerical differences may exist between the static cache (CUDA graphs) and the original dynamic cache implementations due to varying CUDA kernel paths and reduction orders, although perceptual parity is maintained through testing. The advanced ICL voice cloning mode might exhibit minor artifacts at the start of generated speech if the reference audio ends abruptly, though a silence-padding fix is applied by default.

faster-qwen3-tts by andimarafioti

Explore Similar Projects

QwenVoice by PowerBeef

Kitten-TTS-Server by devnen

MiraTTS by ysharma3501

chatterbox-vllm by randombk

VITA-Audio by VITA-MLLM

Dia-TTS-Server by devnen

qwen3-tts-apple-silicon by kapi2800

sesame_csm_openai by phildougherty

dia2 by nari-labs

soprano by ekwek1

voxtral.c by antirez

tortoise-tts by neonbjb