ComfyUI-VoxCPM  by wildminder

Speech synthesis and voice cloning node for ComfyUI

Created 3 weeks ago

New!

292 stars

Top 90.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

ComfyUI-VoxCPM integrates VoxCPM, a novel tokenizer-free Text-to-Speech (TTS) system, into the ComfyUI workflow. It enables highly expressive speech generation and true-to-life zero-shot voice cloning, offering advanced audio synthesis capabilities for researchers and power users. The node automates model management and provides fine-grained control over audio output.

How It Works

VoxCPM models speech in a continuous space using the MiniCPM-4 backbone, enabling context-aware prosody and emotional tone generation without traditional tokenization. This approach facilitates accurate voice cloning from short audio samples and high-quality zero-shot TTS. The ComfyUI node streamlines integration by handling automatic model downloads, VRAM management, and audio processing, allowing direct generation from text and optional reference audio.

Quick Start & Requirements

  • Installation: Install via ComfyUI Manager by searching for "ComfyUI-VoxCPM", or manually clone the repository into ComfyUI/custom_nodes/ and run pip install -r requirements.txt.
  • Prerequisites: ComfyUI, Python packages (listed in requirements.txt). Models are automatically downloaded to ComfyUI/models/tts/VoxCPM/ on first use.
  • Setup: Initial node usage triggers automatic model download.
  • Links: Usage details are within the README; no separate quick-start or demo links are provided.

Highlighted Details

  • Achieves competitive zero-shot TTS benchmark results (Seed-TTS-eval, CV3-eval) with low Word/Character Error Rates and high Similarity scores for English and Chinese.
  • Enables context-aware expressive speech and true-to-life voice cloning from brief audio samples.
  • Features automatic model management and efficient VRAM utilization within ComfyUI.
  • Supports fine-grained control over synthesis via cfg_value and inference_timesteps, and offers phoneme input for precise pronunciation.

Maintenance & Community

  • GitHub repository metrics (stars, issues, forks) are linked via shields.
  • No direct links to community channels (Discord, Slack) or a public roadmap are present in the README.
  • Acknowledgments include OpenBMB & ModelBest for VoxCPM and the ComfyUI team.

Licensing & Compatibility

  • License: Apache-2.0 License for the VoxCPM model and components.
  • Compatibility: Released for research and development purposes; responsible use is emphasized.

Limitations & Caveats

Potential for misuse exists due to powerful voice cloning capabilities, requiring users to adhere to ethical and legal standards. The model may exhibit instability with very long or complex input texts. Primarily trained on Chinese and English; performance on other languages is not guaranteed. The node's built-in denoiser (ZipEnhancer) has been removed to align with ComfyUI's modular philosophy.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
8
Star History
295 stars in the last 27 days

Explore Similar Projects

Feedback? Help us improve.