TTS model for human-like, expressive speech
Top 12.1% on sourcepulse
MetaVoice-1B is a foundational text-to-speech (TTS) model designed for generating human-like, expressive speech. It targets researchers and developers seeking high-quality, emotionally nuanced audio synthesis, offering zero-shot voice cloning and fine-tuning capabilities for diverse voice applications.
How It Works
The model predicts EnCodec tokens from text and speaker information, then diffuses these to waveform level. A causal GPT generates the initial EnCodec hierarchies, conditioned on speaker embeddings from a separate verification network. Condition-free sampling enhances cloning. A small, non-causal transformer predicts the remaining hierarchies, enabling parallel generation. Multi-band diffusion creates waveforms, with DeepFilterNet cleaning up artifacts for clearer audio.
Quick Start & Requirements
poetry install && poetry run pip install torch==2.2.1 torchaudio==2.2.1
(Poetry recommended)ffmpeg
, wget
, rust
.ffmpeg
, rustup
, and poetry
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
int4
, int8
) offer faster inference but degrade audio quality.1 year ago
1 day