VoxCPM  by OpenBMB

Tokenizer-free TTS for context-aware speech and voice cloning

Created 2 months ago
2,203 stars

Top 20.4% on SourcePulse

GitHubView on GitHub
Project Summary

VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system focused on context-aware speech generation and true-to-life zero-shot voice cloning. It models speech directly in a continuous space, bypassing the limitations of discrete tokenization. This system is designed for researchers and developers seeking highly expressive, natural-sounding synthetic speech with advanced voice cloning capabilities, offering enhanced realism and expressiveness through its unique architecture.

How It Works

VoxCPM utilizes an end-to-end diffusion autoregressive architecture, generating continuous speech representations directly from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling via hierarchical language modeling and Fast Sampling Quantization (FSQ) constraints. This continuous space modeling approach enhances expressiveness and generation stability, enabling more natural and contextually appropriate speech synthesis.

Quick Start & Requirements

  • Installation: Install via PyPI: pip install voxcpm.
  • Prerequisites: Python environment. Models download automatically or can be pre-downloaded. Requires libraries like soundfile.
  • Hardware: Achieves a Real-Time Factor (RTF) as low as 0.17 on an NVIDIA RTX 4090 GPU for streaming synthesis.
  • Demo: A Gradio web demo is available via python app.py.

Highlighted Details

  • Achieves a Real-Time Factor (RTF) of 0.17 on an NVIDIA RTX 4090.
  • Delivers competitive zero-shot TTS performance on benchmarks like Seed-TTS-eval and CV3-eval, with low error rates and high similarity scores.
  • Features context-aware generation, adapting speaking style to text content.
  • Enables true-to-life zero-shot voice cloning, capturing speaker nuances from short audio references.

Maintenance & Community

  • Developed by ModelBest and THUHCSI.
  • Contact available via WeChat.
  • Future updates planned include a technical report and higher sampling rate support.

Licensing & Compatibility

  • License: Apache-2.0.
  • Compatibility: Intended for research and development. Production/commercial use requires rigorous testing and safety evaluations. Users must avoid misuse for illegal/unethical purposes or infringing rights; AI-generated content should be marked.

Limitations & Caveats

May produce unexpected, biased, or artifact-laden outputs. Voice cloning poses a risk of misuse for deepfakes and impersonation. Limited direct control over specific speech attributes like emotion or style. Primarily supports Chinese and English; performance on other languages is not guaranteed. Potential for instability with very long or expressive inputs.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
19
Star History
253 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.0%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.