ComfyUI-OmniVoice-TTS  by Saganaki22

ComfyUI tool for zero-shot multilingual text-to-speech

Created 1 week ago

New!

276 stars

Top 93.7% on SourcePulse

GitHubView on GitHub
Project Summary

OmniVoice TTS provides advanced text-to-speech capabilities within the ComfyUI workflow, targeting users seeking high-fidelity, multilingual voice synthesis. It enables zero-shot voice cloning, custom voice design, and multi-speaker dialogue generation, offering state-of-the-art quality and extensive language support.

How It Works

This project integrates the OmniVoice TTS models into ComfyUI via custom nodes. It leverages diffusion models for synthesis, supporting over 600 languages with zero-shot voice cloning from short audio samples and voice design via text descriptions. The architecture utilizes a Qwen3 backbone, with an optional SageAttention backend for GPU-accelerated attention on compatible hardware (SM80+). Key features include fast inference (RTF as low as 0.025), support for non-verbal expression tags, and automatic model downloading.

Quick Start & Requirements

  • Installation: Recommended via ComfyUI Manager (search "OmniVoice"). Manual install involves cloning the repository into ComfyUI/custom_nodes and running python install.py.
  • Prerequisites: ComfyUI, Python. Crucially, the omnivoice pip package's strict torch==2.8.* dependency is handled via --no-deps during installation to prevent ComfyUI's GPU acceleration from breaking. A GPU is highly recommended, especially for SageAttention support (CUDA, SM80+).
  • Links: Repository: https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS

Highlighted Details

  • 600+ Languages: Offers the broadest language coverage among zero-shot TTS models.
  • Zero-Shot Voice Cloning: Clones voices from just 3-15 seconds of reference audio.
  • Voice Design: Synthesizes voices based on descriptive text attributes (gender, age, accent, pitch).
  • Multi-Speaker Dialogue: Generates conversations with distinct speakers using [Speaker_N]: tags.
  • Fast Inference: Achieves a Real-Time Factor (RTF) as low as 0.025, enabling near real-time generation.
  • Non-Verbal Expressions: Supports inline tags for naturalistic speech elements like [laughter] and [sigh].
  • SageAttention Support: Provides GPU-optimized attention kernels for improved performance on compatible hardware.
  • Auto-Download: Models are automatically downloaded from HuggingFace upon first use.
  • VRAM Efficiency: Features automatic CPU offloading and smart cache invalidation for reduced memory footprint.

Maintenance & Community

The provided documentation does not detail specific community channels (e.g., Discord, Slack), active maintainers beyond the primary author, or a public roadmap. Credits acknowledge the original OmniVoice model authors.

Licensing & Compatibility

The ComfyUI-OmniVoice-TTS custom node is released under the Apache 2.0 License. The underlying OmniVoice model has its own separate license, which users must consult (refer to k2-fsa/OmniVoice). Apache 2.0 is generally permissive for commercial use, but the model's license may impose restrictions.

Limitations & Caveats

Installation requires careful management of Python dependencies, particularly PyTorch versions, to avoid conflicts with ComfyUI's core functionality. The high-performance SageAttention backend is restricted to GPUs with SM80+ compute capability (NVIDIA Ampere architecture or newer). The transformers library version requirement (>=5.3.0) may also conflict with other custom nodes. Dialect and style instructions are limited to a predefined set of supported values.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
21
Star History
277 stars in the last 9 days

Explore Similar Projects

Feedback? Help us improve.