VoxCPM by OpenBMB

Tokenizer-free TTS for context-aware speech and voice cloning

Created 10 months ago

33,959 stars

Top 1.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system focused on context-aware speech generation and true-to-life zero-shot voice cloning. It models speech directly in a continuous space, bypassing the limitations of discrete tokenization. This system is designed for researchers and developers seeking highly expressive, natural-sounding synthetic speech with advanced voice cloning capabilities, offering enhanced realism and expressiveness through its unique architecture.

How It Works

VoxCPM utilizes an end-to-end diffusion autoregressive architecture, generating continuous speech representations directly from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling via hierarchical language modeling and Fast Sampling Quantization (FSQ) constraints. This continuous space modeling approach enhances expressiveness and generation stability, enabling more natural and contextually appropriate speech synthesis.

Quick Start & Requirements

Installation: Install via PyPI: pip install voxcpm.
Prerequisites: Python environment. Models download automatically or can be pre-downloaded. Requires libraries like soundfile.
Hardware: Achieves a Real-Time Factor (RTF) as low as 0.17 on an NVIDIA RTX 4090 GPU for streaming synthesis.
Demo: A Gradio web demo is available via python app.py.

Highlighted Details

Achieves a Real-Time Factor (RTF) of 0.17 on an NVIDIA RTX 4090.
Delivers competitive zero-shot TTS performance on benchmarks like Seed-TTS-eval and CV3-eval, with low error rates and high similarity scores.
Features context-aware generation, adapting speaking style to text content.
Enables true-to-life zero-shot voice cloning, capturing speaker nuances from short audio references.

Maintenance & Community

Developed by ModelBest and THUHCSI.
Contact available via WeChat.
Future updates planned include a technical report and higher sampling rate support.

Licensing & Compatibility

License: Apache-2.0.
Compatibility: Intended for research and development. Production/commercial use requires rigorous testing and safety evaluations. Users must avoid misuse for illegal/unethical purposes or infringing rights; AI-generated content should be marked.

Limitations & Caveats

May produce unexpected, biased, or artifact-laden outputs. Voice cloning poses a risk of misuse for deepfakes and impersonation. Limited direct control over specific speech attributes like emotion or style. Primarily supports Chinese and English; performance on other languages is not guaranteed. Potential for instability with very long or expressive inputs.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2,203 stars in the last 30 days