ControlSpeech by jishengpeng

Speech synthesis with simultaneous zero-shot speaker cloning and language style control

Created 1 year ago

269 stars

Top 95.6% on SourcePulse

Project Summary

ControlSpeech enables simultaneous zero-shot speaker cloning and language style control in text-to-speech synthesis, targeting researchers and developers in speech technology. It offers fine-grained control over synthesized speech characteristics using a decoupled codec approach.

How It Works

The project leverages a decoupled codec architecture, separating acoustic and linguistic information. This design allows for independent manipulation of speaker identity and language style, facilitating zero-shot adaptation to new speakers and styles without extensive retraining. The system is built upon the VccmDataset and includes evaluation metrics for speed, pitch, energy, emotion, and speaker verification.

Quick Start & Requirements

Installation: Clone the repository and install dependencies using pip install -r requirements.txt within a conda environment (Python 3.9 recommended).
Prerequisites: CUDA-enabled GPU is required for inference and training. The project also utilizes external tools like MFA for alignment and pre-trained models (e.g., emotion2Vec, Whisper, WavLM-SV).
Resources: Baseline checkpoints need to be downloaded separately.
Links: VccmDataset, ControlToolkit

Highlighted Details

Supports zero-shot speaker cloning and language style control.
Utilizes a novel decoupled codec for enhanced control.
Provides a comprehensive dataset (VccmDataset) and evaluation metrics.
Includes baseline implementations for PromptTTS and PromptStyle.

Maintenance & Community

The project is associated with the ACL 2025 conference and ICASSP 2024. Recent updates include the release of WavChat and the WavTokenizer codec model. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

The repository is released under a permissive license, suitable for academic and commercial use. Specific license details are not explicitly stated but the nature of the releases suggests broad compatibility.

Limitations & Caveats

The setup requires manual download of baseline checkpoints and potentially pre-computation of alignments using external tools like MFA, which can be time-consuming. The project appears to be research-oriented, and production-readiness or extensive user support may vary.

ControlSpeech by jishengpeng

Explore Similar Projects

SpeechGPT-2.0-preview by OpenMOSS

ComfyUI-F5-TTS by niknah

ComfyUI-VoxCPM by wildminder

ComfyUI_IndexTTS by billwuhao

FireRedTTS by FireRedTeam

Step-Audio by stepfun-ai

KittenTTS by KittenML

parler-tts by huggingface

VALL-E-X by Plachtaa

Zonos by Zyphra

dia by nari-labs

CosyVoice by FunAudioLLM