ComfyUI-Qwen-TTS  by flybirdxx

Advanced ComfyUI nodes for speech synthesis and voice AI

Created 6 days ago

New!

664 stars

Top 50.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project provides ComfyUI custom nodes for advanced speech synthesis, zero-shot voice cloning, and voice design, leveraging Alibaba's Qwen3-TTS models. It targets ComfyUI users seeking a node-based workflow for high-quality audio generation, enabling custom voice creation and realistic speech synthesis.

How It Works

The integration brings Qwen3-TTS capabilities into ComfyUI via specialized nodes for TTS, zero-shot voice cloning from short audio, and voice design from natural language descriptions. It supports efficient inference with 12Hz/25Hz tokenizers, features on-demand model loading with global caching, and allows selection from multiple attention mechanisms (sage_attn, flash_attn, sdpa, eager) with auto-detection and fallback. An optional model unloading feature manages GPU memory for limited VRAM users.

Quick Start & Requirements

  • Installation: Base dependencies: pip install torch torchaudio transformers librosa accelerate. Optional performance attention mechanisms (sage_attn, flash_attn) require separate installation.
  • Prerequisites: Python, PyTorch, Transformers library. GPU acceleration recommended.
  • Links: No direct quick-start guides or demo links provided.

Highlighted Details

  • Zero-shot voice cloning from 5-15s reference audio, with optional reference text for quality improvement.
  • Voice design node generates unique voices from descriptive text prompts.
  • Native support for 10 languages.
  • Flexible attention mechanism selection (auto, sage_attn, flash_attn, sdpa, eager) optimizes performance.
  • On-demand model loading with global caching and unload_model_after_generate option for VRAM management.
  • New nodes facilitate multi-role dialogue generation with up to 8 distinct voices.

Maintenance & Community

  • The README provides no specific details on maintainers, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

  • License: Project code is Apache License 2.0. Model weights are governed by the Qwen3-TTS License Agreement.
  • Compatibility: Apache 2.0 is permissive for commercial use. Users must review the Qwen3-TTS License Agreement for model weights, as it may impose specific restrictions on commercial applications.

Limitations & Caveats

  • Users with limited VRAM (< 8GB) may need to enable unload_model_after_generate, potentially impacting generation speed if models are frequently swapped.
  • Optimal attention mechanism performance requires installing specific libraries (sage_attn, flash_attn); otherwise, slower built-in options are used.
  • Detailed Python version requirements or specific hardware benchmarks are not explicitly stated.
Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
45
Star History
669 stars in the last 6 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.