Discover and explore top open-source AI tools and projects—updated daily.
DarioFTComfyUI nodes for advanced text-to-speech synthesis
Top 99.3% on SourcePulse
A ComfyUI custom node suite enabling advanced text-to-speech (TTS) capabilities powered by Qwen3-TTS models. It targets ComfyUI users seeking to integrate sophisticated speech generation, including custom voices, voice design, cloning, and fine-tuning, directly into their visual workflows, offering a powerful tool for content creators and researchers.
How It Works
This suite integrates Qwen3-TTS models (1.7B and 0.6B variants) as custom nodes within the ComfyUI framework. It supports generating speech using preset voices, creating novel voices via natural language descriptions (Voice Design), cloning voices from short audio samples, and fine-tuning models on custom datasets. The system prioritizes efficient model management with on-demand downloads and organized storage. For optimal performance, it leverages flash_attention_2 where available, with automatic fallback to PyTorch's sdpa.
Quick Start & Requirements
ComfyUI/custom_nodes and run pip install -r requirements.txt within the cloned directory. For portable ComfyUI installations, use the embedded Python interpreter for the pip command.qwen-tts dependency mandates transformers==4.57.3, which may conflict with other custom nodes requiring newer versions; consider using separate Python environments.ComfyUI/models/Qwen3-TTS/.Highlighted Details
Maintenance & Community
The provided README does not detail specific contributors, sponsorships, or community channels (e.g., Discord, Slack). Maintenance status and community engagement are not explicitly described.
Licensing & Compatibility
The license for this custom node is not explicitly stated in the README. Users should exercise caution regarding commercial use or integration with closed-source projects until the licensing terms are clarified.
Limitations & Caveats
Generation hangs or GPUs remaining at 100% utilization can occur due to upstream Qwen3-TTS issues, particularly with long reference audio or lengthy generated outputs; solutions involve reducing max_new_tokens or reference audio duration. Inference performance on Windows may be slower without FlashAttention 2, with WSL2 recommended for better performance. Dependency conflicts with the transformers library version are possible.
2 months ago
Inactive