ComfyUI-VibeVoice by wildminder

Generate expressive, long-form, multi-speaker conversational audio within ComfyUI

Created 4 months ago

549 stars

Top 58.2% on SourcePulse

Project Summary

This ComfyUI custom node integrates Microsoft's VibeVoice, a frontier model for expressive, long-form, multi-speaker conversational audio generation. It empowers ComfyUI users to create natural-sounding dialogue, podcasts, and audio content with consistent voices for up to four speakers, offering fine-grained control and significant VRAM savings through advanced quantization techniques.

How It Works

The node acts as a bridge, bringing VibeVoice's capabilities into the modular ComfyUI workflow. It automates model downloading and memory management, allowing users to generate speech directly from text scripts and reference audio files for zero-shot voice cloning. Key features include multi-speaker support, advanced attention mechanisms (eager, sdpa, flash_attention_2, sage), and robust 4-bit quantization for the language model, drastically reducing VRAM usage and enhancing stability.

Quick Start & Requirements

Install: Via ComfyUI Manager (recommended) or manually clone the repository into the ComfyUI/custom_nodes/ directory.
Dependencies: Install required packages using pip install -r requirements.txt. bitsandbytes is necessary for quantization support. The sageattention library must be installed separately for the 'sage' attention mode.
Prerequisites: A working ComfyUI installation. Models are automatically downloaded on first use.

Highlighted Details

4-Bit Quantization: Enables running the LLM component in 4-bit mode, offering substantial VRAM savings (e.g., >4.4 GB for VibeVoice-Large) and improved inference speed for larger models.
Attention Modes: Supports eager, sdpa, flash_attention_2, and sage attention implementations, allowing users to balance speed, memory usage, and compatibility.
Multi-Speaker & Voice Cloning: Generates conversations with up to 4 distinct voices using zero-shot voice cloning from any .wav or .mp3 reference audio file.
Script Formatting: Dialogue assignment to speakers is handled via simple text prefixes like Speaker 1: ....

Maintenance & Community

The project includes shields for stargazers, issues, contributors, and forks, indicating community engagement. Specific community channels like Discord or Slack are not detailed in the provided README.

Licensing & Compatibility

The custom node itself is distributed under the MIT License. However, the underlying VibeVoice model and its components are subject to the licenses provided by Microsoft. Compatibility with Transformers library versions (pre- and post-4.56) is ensured.

Limitations & Caveats

The model may exhibit emergent behaviors such as spontaneous generation of background music or attempts at singing, particularly if the reference audio contains such elements or the text prompts are suggestive. The model was not specifically trained on singing data, so results may vary.

ComfyUI-VibeVoice by wildminder

Explore Similar Projects

SpeechGPT-2.0-preview by OpenMOSS

ComfyUI-F5-TTS by niknah

ComfyUI-VoxCPM by wildminder

ComfyUI_IndexTTS by billwuhao

TTS-Audio-Suite by diodiogod

dia2 by nari-labs

sesame_csm_openai by phildougherty

MOSS-TTSD by OpenMOSS

FireRedTTS2 by FireRedTeam

mini-omni by gpt-omni

Zonos by Zyphra

CosyVoice by FunAudioLLM