Speech model (TTS) for prosody generation
Top 17.5% on sourcepulse
MARS5-TTS is an open-source English text-to-speech model from CAMB.AI, designed for generating speech with highly natural prosody, even in challenging scenarios like sports commentary or anime. It targets researchers and developers needing advanced TTS capabilities, offering a novel two-stage AR-NAR pipeline for high-quality voice cloning and prosody control.
How It Works
MARS5 employs a two-stage approach: an autoregressive transformer generates coarse speech features, followed by a multinomial diffusion model (DDPM) that refines these features. This pipeline allows for fine-grained control over prosody via punctuation and capitalization in the input text. Speaker identity is captured using 2-12 second audio references, with optional reference transcripts enabling "deep cloning" for enhanced quality.
Quick Start & Requirements
pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex
torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
or from Hugging Face.Highlighted Details
.pt
) and safetensors
formats.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model is primarily English-focused, and long-form generation is not natively supported, requiring chunking strategies. Performance on Apple's MPS backend may be impacted by unsupported operators.
1 year ago
1 week