faster-qwen3-tts  by andimarafioti

Real-time TTS inference acceleration

Created 3 weeks ago

New!

485 stars

Top 63.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project accelerates Qwen3-TTS, a text-to-speech model, by leveraging CUDA graphs for real-time inference. It targets developers and researchers requiring low-latency, high-throughput speech generation, offering significant performance improvements over standard PyTorch implementations without relying on external libraries like vLLM or Triton. The primary benefit is achieving faster-than-real-time (RTF) audio synthesis, enabling applications like live voice assistants or interactive content generation.

How It Works

The core innovation lies in capturing the entire Qwen3-TTS decode step—comprising the Talker and Code Predictor transformers—into a single torch.cuda.CUDAGraph. This eliminates the overhead of numerous small CUDA kernel launches and Python interpreter calls per step, replaying the entire sequence as one optimized GPU operation. It employs a static KV cache with padded attention to manage variable-length sequences within fixed-size tensors, contrasting with the original dynamic cache approach. This static capture and replay mechanism is key to its performance gains.

Quick Start & Requirements

  • Installation: pip install faster-qwen3-tts
  • Prerequisites: Python 3.10+, NVIDIA GPU with CUDA.
  • Demo UI: Install with pip install -e ".[demo]" and run python demo/server.py.
  • Links: GitHub Repository (implied by context).

Highlighted Details

  • Achieves significant speedups (up to 9.8x TTFA improvement reported on RTX 4060) using CUDA graphs.
  • Supports both streaming (yielding audio chunks during generation) and non-streaming output modes.
  • Offers multiple generation modes: voice cloning (simple x-vector or advanced ICL), custom voice presets, and instruction-based voice design.
  • Includes a server mode for keeping models loaded and ready for inference.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap were found in the provided text.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Numerical differences may exist between the static cache (CUDA graphs) and the original dynamic cache implementations due to varying CUDA kernel paths and reduction orders, although perceptual parity is maintained through testing. The advanced ICL voice cloning mode might exhibit minor artifacts at the start of generated speech if the reference audio ends abruptly, though a silence-padding fix is applied by default.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
46
Issues (30d)
13
Star History
493 stars in the last 25 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
3 more.

voxtral.c by antirez

2.3%
2k
Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B
Created 1 month ago
Updated 3 weeks ago
Feedback? Help us improve.