Discover and explore top open-source AI tools and projects—updated daily.
Generate expressive, long-form, multi-speaker conversational audio within ComfyUI
New!
Top 71.5% on SourcePulse
This ComfyUI custom node integrates Microsoft's VibeVoice, a frontier model for expressive, long-form, multi-speaker conversational audio generation. It empowers ComfyUI users to create natural-sounding dialogue, podcasts, and audio content with consistent voices for up to four speakers, offering fine-grained control and significant VRAM savings through advanced quantization techniques.
How It Works
The node acts as a bridge, bringing VibeVoice's capabilities into the modular ComfyUI workflow. It automates model downloading and memory management, allowing users to generate speech directly from text scripts and reference audio files for zero-shot voice cloning. Key features include multi-speaker support, advanced attention mechanisms (eager, sdpa, flash_attention_2, sage), and robust 4-bit quantization for the language model, drastically reducing VRAM usage and enhancing stability.
Quick Start & Requirements
ComfyUI/custom_nodes/
directory.pip install -r requirements.txt
. bitsandbytes
is necessary for quantization support. The sageattention
library must be installed separately for the 'sage' attention mode.Highlighted Details
eager
, sdpa
, flash_attention_2
, and sage
attention implementations, allowing users to balance speed, memory usage, and compatibility..wav
or .mp3
reference audio file.Speaker 1: ...
.Maintenance & Community
The project includes shields for stargazers, issues, contributors, and forks, indicating community engagement. Specific community channels like Discord or Slack are not detailed in the provided README.
Licensing & Compatibility
The custom node itself is distributed under the MIT License. However, the underlying VibeVoice model and its components are subject to the licenses provided by Microsoft. Compatibility with Transformers library versions (pre- and post-4.56) is ensured.
Limitations & Caveats
The model may exhibit emergent behaviors such as spontaneous generation of background music or attempts at singing, particularly if the reference audio contains such elements or the text prompts are suggestive. The model was not specifically trained on singing data, so results may vary.
1 week ago
Inactive