ComfyUI-VibeVoice  by wildminder

Generate expressive, long-form, multi-speaker conversational audio within ComfyUI

Created 3 weeks ago

New!

408 stars

Top 71.5% on SourcePulse

GitHubView on GitHub
Project Summary

This ComfyUI custom node integrates Microsoft's VibeVoice, a frontier model for expressive, long-form, multi-speaker conversational audio generation. It empowers ComfyUI users to create natural-sounding dialogue, podcasts, and audio content with consistent voices for up to four speakers, offering fine-grained control and significant VRAM savings through advanced quantization techniques.

How It Works

The node acts as a bridge, bringing VibeVoice's capabilities into the modular ComfyUI workflow. It automates model downloading and memory management, allowing users to generate speech directly from text scripts and reference audio files for zero-shot voice cloning. Key features include multi-speaker support, advanced attention mechanisms (eager, sdpa, flash_attention_2, sage), and robust 4-bit quantization for the language model, drastically reducing VRAM usage and enhancing stability.

Quick Start & Requirements

  • Install: Via ComfyUI Manager (recommended) or manually clone the repository into the ComfyUI/custom_nodes/ directory.
  • Dependencies: Install required packages using pip install -r requirements.txt. bitsandbytes is necessary for quantization support. The sageattention library must be installed separately for the 'sage' attention mode.
  • Prerequisites: A working ComfyUI installation. Models are automatically downloaded on first use.

Highlighted Details

  • 4-Bit Quantization: Enables running the LLM component in 4-bit mode, offering substantial VRAM savings (e.g., >4.4 GB for VibeVoice-Large) and improved inference speed for larger models.
  • Attention Modes: Supports eager, sdpa, flash_attention_2, and sage attention implementations, allowing users to balance speed, memory usage, and compatibility.
  • Multi-Speaker & Voice Cloning: Generates conversations with up to 4 distinct voices using zero-shot voice cloning from any .wav or .mp3 reference audio file.
  • Script Formatting: Dialogue assignment to speakers is handled via simple text prefixes like Speaker 1: ....

Maintenance & Community

The project includes shields for stargazers, issues, contributors, and forks, indicating community engagement. Specific community channels like Discord or Slack are not detailed in the provided README.

Licensing & Compatibility

The custom node itself is distributed under the MIT License. However, the underlying VibeVoice model and its components are subject to the licenses provided by Microsoft. Compatibility with Transformers library versions (pre- and post-4.56) is ensured.

Limitations & Caveats

The model may exhibit emergent behaviors such as spontaneous generation of background music or attempts at singing, particularly if the reference audio contains such elements or the text prompts are suggestive. The model was not specifically trained on singing data, so results may vary.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
48
Star History
412 stars in the last 22 days

Explore Similar Projects

Feedback? Help us improve.