Qwen3-TTS by QwenLM

Powerful speech generation models for diverse applications

Created 1 week ago

New!

5,531 stars

Top 9.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jiaming Song

Chief Scientist at Luma AI

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Qwen3-TTS provides a powerful, open-source suite of Text-to-Speech models from Alibaba Cloud, enabling stable, expressive, and streaming speech generation. It targets developers and researchers seeking advanced capabilities like free-form voice design, vivid voice cloning, and natural language-based voice control across multiple languages, offering a comprehensive solution for high-fidelity speech synthesis.

How It Works

The system leverages a proprietary Qwen3-TTS-Tokenizer-12Hz for efficient acoustic compression and semantic modeling. Its core is a discrete multi-codebook LM architecture, offering an end-to-end approach that bypasses traditional bottlenecks. A key innovation is the Dual-Track hybrid streaming architecture, enabling ultra-low-latency (97ms) real-time generation. Natural language instructions are integrated for fine-grained control over timbre, emotion, and prosody, adapting dynamically to text semantics.

Quick Start & Requirements

Installation is straightforward via pip install -U qwen-tts. For development, clone the repository and install in editable mode. Python 3.12 is recommended. GPU acceleration is essential for performance, with device_map="cuda:0" and torch.bfloat16 or torch.float16 usage. FlashAttention 2 is recommended for reduced GPU memory, requiring compatible hardware. Links to Hugging Face, ModelScope, Discord, and vLLM-Omni are provided.

Highlighted Details

Supports 10 major languages and multiple dialectal voice profiles.
Achieves end-to-end synthesis latency as low as 97ms for real-time interactive scenarios.
Offers distinct models for Custom Voice, Voice Design, and rapid 3-second Voice Cloning.
Demonstrates competitive performance against leading TTS models in various benchmarks, including multilingual and controllable generation tasks.

Maintenance & Community

Developed by the Qwen team at Alibaba Cloud. Community support is available via a linked Discord channel and WeChat.

Licensing & Compatibility

The specific open-source license is not detailed within the provided README text. Compatibility for commercial use or closed-source linking would depend on the unstated license terms.

Limitations & Caveats

FlashAttention 2 requires specific hardware and data types. vLLM-Omni currently supports offline inference, with online serving planned. The web UI demo for the Base model requires HTTPS for microphone access in modern browsers.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5,571 stars in the last 7 days