Qwen3-TTS  by QwenLM

Powerful speech generation models for diverse applications

Created 2 months ago
9,888 stars

Top 5.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Qwen3-TTS provides a powerful, open-source suite of Text-to-Speech models from Alibaba Cloud, enabling stable, expressive, and streaming speech generation. It targets developers and researchers seeking advanced capabilities like free-form voice design, vivid voice cloning, and natural language-based voice control across multiple languages, offering a comprehensive solution for high-fidelity speech synthesis.

How It Works

The system leverages a proprietary Qwen3-TTS-Tokenizer-12Hz for efficient acoustic compression and semantic modeling. Its core is a discrete multi-codebook LM architecture, offering an end-to-end approach that bypasses traditional bottlenecks. A key innovation is the Dual-Track hybrid streaming architecture, enabling ultra-low-latency (97ms) real-time generation. Natural language instructions are integrated for fine-grained control over timbre, emotion, and prosody, adapting dynamically to text semantics.

Quick Start & Requirements

Installation is straightforward via pip install -U qwen-tts. For development, clone the repository and install in editable mode. Python 3.12 is recommended. GPU acceleration is essential for performance, with device_map="cuda:0" and torch.bfloat16 or torch.float16 usage. FlashAttention 2 is recommended for reduced GPU memory, requiring compatible hardware. Links to Hugging Face, ModelScope, Discord, and vLLM-Omni are provided.

Highlighted Details

  • Supports 10 major languages and multiple dialectal voice profiles.
  • Achieves end-to-end synthesis latency as low as 97ms for real-time interactive scenarios.
  • Offers distinct models for Custom Voice, Voice Design, and rapid 3-second Voice Cloning.
  • Demonstrates competitive performance against leading TTS models in various benchmarks, including multilingual and controllable generation tasks.

Maintenance & Community

Developed by the Qwen team at Alibaba Cloud. Community support is available via a linked Discord channel and WeChat.

Licensing & Compatibility

The specific open-source license is not detailed within the provided README text. Compatibility for commercial use or closed-source linking would depend on the unstated license terms.

Limitations & Caveats

FlashAttention 2 requires specific hardware and data types. vLLM-Omni currently supports offline inference, with online serving planned. The web UI demo for the Base model requires HTTPS for microphone access in modern browsers.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
28
Star History
1,624 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.3%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.