ControlSpeech  by jishengpeng

Speech synthesis with simultaneous zero-shot speaker cloning and language style control

Created 1 year ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

ControlSpeech enables simultaneous zero-shot speaker cloning and language style control in text-to-speech synthesis, targeting researchers and developers in speech technology. It offers fine-grained control over synthesized speech characteristics using a decoupled codec approach.

How It Works

The project leverages a decoupled codec architecture, separating acoustic and linguistic information. This design allows for independent manipulation of speaker identity and language style, facilitating zero-shot adaptation to new speakers and styles without extensive retraining. The system is built upon the VccmDataset and includes evaluation metrics for speed, pitch, energy, emotion, and speaker verification.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using pip install -r requirements.txt within a conda environment (Python 3.9 recommended).
  • Prerequisites: CUDA-enabled GPU is required for inference and training. The project also utilizes external tools like MFA for alignment and pre-trained models (e.g., emotion2Vec, Whisper, WavLM-SV).
  • Resources: Baseline checkpoints need to be downloaded separately.
  • Links: VccmDataset, ControlToolkit

Highlighted Details

  • Supports zero-shot speaker cloning and language style control.
  • Utilizes a novel decoupled codec for enhanced control.
  • Provides a comprehensive dataset (VccmDataset) and evaluation metrics.
  • Includes baseline implementations for PromptTTS and PromptStyle.

Maintenance & Community

The project is associated with the ACL 2025 conference and ICASSP 2024. Recent updates include the release of WavChat and the WavTokenizer codec model. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

The repository is released under a permissive license, suitable for academic and commercial use. Specific license details are not explicitly stated but the nature of the releases suggests broad compatibility.

Limitations & Caveats

The setup requires manual download of baseline checkpoints and potentially pre-computation of alignments using external tools like MFA, which can be time-consuming. The project appears to be research-oriented, and production-readiness or extensive user support may vary.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.2%
34k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 5 months ago
Feedback? Help us improve.