ComfyUI-Qwen3-TTS by DarioFT

ComfyUI nodes for advanced text-to-speech synthesis

Created 3 months ago

253 stars

Top 99.3% on SourcePulse

Project Summary

A ComfyUI custom node suite enabling advanced text-to-speech (TTS) capabilities powered by Qwen3-TTS models. It targets ComfyUI users seeking to integrate sophisticated speech generation, including custom voices, voice design, cloning, and fine-tuning, directly into their visual workflows, offering a powerful tool for content creators and researchers.

How It Works

This suite integrates Qwen3-TTS models (1.7B and 0.6B variants) as custom nodes within the ComfyUI framework. It supports generating speech using preset voices, creating novel voices via natural language descriptions (Voice Design), cloning voices from short audio samples, and fine-tuning models on custom datasets. The system prioritizes efficient model management with on-demand downloads and organized storage. For optimal performance, it leverages flash_attention_2 where available, with automatic fallback to PyTorch's sdpa.

Quick Start & Requirements

Installation: Clone the repository into ComfyUI/custom_nodes and run pip install -r requirements.txt within the cloned directory. For portable ComfyUI installations, use the embedded Python interpreter for the pip command.
Prerequisites: A CUDA-compatible PyTorch installation is required for GPU acceleration. Note that the qwen-tts dependency mandates transformers==4.57.3, which may conflict with other custom nodes requiring newer versions; consider using separate Python environments.
Model Storage: Models are automatically downloaded and stored in ComfyUI/models/Qwen3-TTS/.

Highlighted Details

Comprehensive Qwen3-TTS features: Custom Voice, Voice Design, Voice Cloning, and Fine-Tuning.
Extensive cross-lingual support for Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.
Fine-tuning optimizations include gradient checkpointing, 8-bit AdamW, and per-epoch checkpointing with automatic cleanup.
Prompt caching mechanism allows saving and reusing voice embeddings for faster voice cloning workflows.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, or community channels (e.g., Discord, Slack). Maintenance status and community engagement are not explicitly described.

Licensing & Compatibility

The license for this custom node is not explicitly stated in the README. Users should exercise caution regarding commercial use or integration with closed-source projects until the licensing terms are clarified.

Limitations & Caveats

Generation hangs or GPUs remaining at 100% utilization can occur due to upstream Qwen3-TTS issues, particularly with long reference audio or lengthy generated outputs; solutions involve reducing max_new_tokens or reference audio duration. Inference performance on Windows may be slower without FlashAttention 2, with WSL2 recommended for better performance. Dependency conflicts with the transformers library version are possible.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days