ComfyUI-Qwen-TTS by flybirdxx

Advanced ComfyUI nodes for speech synthesis and voice AI

Created 6 days ago

New!

664 stars

Top 50.7% on SourcePulse

Project Summary

Summary

This project provides ComfyUI custom nodes for advanced speech synthesis, zero-shot voice cloning, and voice design, leveraging Alibaba's Qwen3-TTS models. It targets ComfyUI users seeking a node-based workflow for high-quality audio generation, enabling custom voice creation and realistic speech synthesis.

How It Works

The integration brings Qwen3-TTS capabilities into ComfyUI via specialized nodes for TTS, zero-shot voice cloning from short audio, and voice design from natural language descriptions. It supports efficient inference with 12Hz/25Hz tokenizers, features on-demand model loading with global caching, and allows selection from multiple attention mechanisms (sage_attn, flash_attn, sdpa, eager) with auto-detection and fallback. An optional model unloading feature manages GPU memory for limited VRAM users.

Quick Start & Requirements

Installation: Base dependencies: pip install torch torchaudio transformers librosa accelerate. Optional performance attention mechanisms (sage_attn, flash_attn) require separate installation.
Prerequisites: Python, PyTorch, Transformers library. GPU acceleration recommended.
Links: No direct quick-start guides or demo links provided.

Highlighted Details

Zero-shot voice cloning from 5-15s reference audio, with optional reference text for quality improvement.
Voice design node generates unique voices from descriptive text prompts.
Native support for 10 languages.
Flexible attention mechanism selection (auto, sage_attn, flash_attn, sdpa, eager) optimizes performance.
On-demand model loading with global caching and unload_model_after_generate option for VRAM management.
New nodes facilitate multi-role dialogue generation with up to 8 distinct voices.

Maintenance & Community

The README provides no specific details on maintainers, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

License: Project code is Apache License 2.0. Model weights are governed by the Qwen3-TTS License Agreement.
Compatibility: Apache 2.0 is permissive for commercial use. Users must review the Qwen3-TTS License Agreement for model weights, as it may impose specific restrictions on commercial applications.

Limitations & Caveats

Users with limited VRAM (< 8GB) may need to enable unload_model_after_generate, potentially impacting generation speed if models are frequently swapped.
Optimal attention mechanism performance requires installing specific libraries (sage_attn, flash_attn); otherwise, slower built-in options are used.
Detailed Python version requirements or specific hardware benchmarks are not explicitly stated.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

8

Issues (30d)

45

Star History

669 stars in the last 6 days

Explore Similar Projects

ControlSpeech by jishengpeng

Speech synthesis with simultaneous zero-shot speaker cloning and language style control

Created 1 year ago

Updated 1 year ago

SpeechGPT-2.0-preview by OpenMOSS

Real-time spoken dialogue system with GPT-4o-level capabilities

Created 1 year ago

Updated 1 year ago

ComfyUI-F5-TTS by niknah

Text-to-speech voice cloning and generation for ComfyUI

Created 1 year ago

Updated 1 month ago

ComfyUI-VoxCPM by wildminder

Speech synthesis and voice cloning node for ComfyUI

Created 4 months ago

Updated 1 month ago

ComfyUI_IndexTTS by billwuhao

High-fidelity voice cloning and dialogue generation

Created 9 months ago

Updated 2 months ago

SonicVale by xcLee001

AI voice generation platform for diverse content

Created 4 months ago

Updated 1 month ago

FireRedTTS2 by FireRedTeam

Streaming TTS for natural, long-form dialogue

Created 4 months ago

Updated 3 months ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow), and

1 more.

parler-tts by huggingface

TTS library for high-quality speech generation, based on a research paper

Created 1 year ago

Updated 1 year ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Qwen3-TTS by QwenLM

Powerful speech generation models for diverse applications

Created 1 week ago

Updated 3 days ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

1 more.

higgs-audio by boson-ai

Expressive text-to-audio generation model

Created 6 months ago

Updated 1 week ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Pietro Schirano

Pietro Schirano(Founder of MagicPath), and

2 more.

metavoice-src by metavoiceio

TTS model for human-like, expressive speech

Created 2 years ago

Updated 1 year ago

CosyVoice by FunAudioLLM

Voice generation model for inference, training, and deployment

Created 1 year ago

Updated 1 week ago

Feedback? Help us improve.