ComfyUI-VoxCPM by wildminder

Speech synthesis and voice cloning node for ComfyUI

Created 5 months ago

388 stars

Top 74.2% on SourcePulse

Project Summary

Summary

ComfyUI-VoxCPM integrates VoxCPM, a novel tokenizer-free Text-to-Speech (TTS) system, into the ComfyUI workflow. It enables highly expressive speech generation and true-to-life zero-shot voice cloning, offering advanced audio synthesis capabilities for researchers and power users. The node automates model management and provides fine-grained control over audio output.

How It Works

VoxCPM models speech in a continuous space using the MiniCPM-4 backbone, enabling context-aware prosody and emotional tone generation without traditional tokenization. This approach facilitates accurate voice cloning from short audio samples and high-quality zero-shot TTS. The ComfyUI node streamlines integration by handling automatic model downloads, VRAM management, and audio processing, allowing direct generation from text and optional reference audio.

Quick Start & Requirements

Installation: Install via ComfyUI Manager by searching for "ComfyUI-VoxCPM", or manually clone the repository into ComfyUI/custom_nodes/ and run pip install -r requirements.txt.
Prerequisites: ComfyUI, Python packages (listed in requirements.txt). Models are automatically downloaded to ComfyUI/models/tts/VoxCPM/ on first use.
Setup: Initial node usage triggers automatic model download.
Links: Usage details are within the README; no separate quick-start or demo links are provided.

Highlighted Details

Achieves competitive zero-shot TTS benchmark results (Seed-TTS-eval, CV3-eval) with low Word/Character Error Rates and high Similarity scores for English and Chinese.
Enables context-aware expressive speech and true-to-life voice cloning from brief audio samples.
Features automatic model management and efficient VRAM utilization within ComfyUI.
Supports fine-grained control over synthesis via cfg_value and inference_timesteps, and offers phoneme input for precise pronunciation.

Maintenance & Community

GitHub repository metrics (stars, issues, forks) are linked via shields.
No direct links to community channels (Discord, Slack) or a public roadmap are present in the README.
Acknowledgments include OpenBMB & ModelBest for VoxCPM and the ComfyUI team.

Licensing & Compatibility

License: Apache-2.0 License for the VoxCPM model and components.
Compatibility: Released for research and development purposes; responsible use is emphasized.

Limitations & Caveats

Potential for misuse exists due to powerful voice cloning capabilities, requiring users to adhere to ethical and legal standards. The model may exhibit instability with very long or complex input texts. Primarily trained on Chinese and English; performance on other languages is not guaranteed. The node's built-in denoiser (ZipEnhancer) has been removed to align with ComfyUI's modular philosophy.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days