ComfyUI-Qwen-TTS  by flybirdxx

Advanced ComfyUI nodes for speech synthesis and voice AI

Created 2 months ago
1,308 stars

Top 30.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project provides ComfyUI custom nodes for advanced speech synthesis, zero-shot voice cloning, and voice design, leveraging Alibaba's Qwen3-TTS models. It targets ComfyUI users seeking a node-based workflow for high-quality audio generation, enabling custom voice creation and realistic speech synthesis.

How It Works

The integration brings Qwen3-TTS capabilities into ComfyUI via specialized nodes for TTS, zero-shot voice cloning from short audio, and voice design from natural language descriptions. It supports efficient inference with 12Hz/25Hz tokenizers, features on-demand model loading with global caching, and allows selection from multiple attention mechanisms (sage_attn, flash_attn, sdpa, eager) with auto-detection and fallback. An optional model unloading feature manages GPU memory for limited VRAM users.

Quick Start & Requirements

  • Installation: Base dependencies: pip install torch torchaudio transformers librosa accelerate. Optional performance attention mechanisms (sage_attn, flash_attn) require separate installation.
  • Prerequisites: Python, PyTorch, Transformers library. GPU acceleration recommended.
  • Links: No direct quick-start guides or demo links provided.

Highlighted Details

  • Zero-shot voice cloning from 5-15s reference audio, with optional reference text for quality improvement.
  • Voice design node generates unique voices from descriptive text prompts.
  • Native support for 10 languages.
  • Flexible attention mechanism selection (auto, sage_attn, flash_attn, sdpa, eager) optimizes performance.
  • On-demand model loading with global caching and unload_model_after_generate option for VRAM management.
  • New nodes facilitate multi-role dialogue generation with up to 8 distinct voices.

Maintenance & Community

  • The README provides no specific details on maintainers, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

  • License: Project code is Apache License 2.0. Model weights are governed by the Qwen3-TTS License Agreement.
  • Compatibility: Apache 2.0 is permissive for commercial use. Users must review the Qwen3-TTS License Agreement for model weights, as it may impose specific restrictions on commercial applications.

Limitations & Caveats

  • Users with limited VRAM (< 8GB) may need to enable unload_model_after_generate, potentially impacting generation speed if models are frequently swapped.
  • Optimal attention mechanism performance requires installing specific libraries (sage_attn, flash_attn); otherwise, slower built-in options are used.
  • Detailed Python version requirements or specific hardware benchmarks are not explicitly stated.
Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
6
Star History
225 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.