metavoice-src  by metavoiceio

TTS model for human-like, expressive speech

created 1 year ago
4,142 stars

Top 12.1% on sourcepulse

GitHubView on GitHub
Project Summary

MetaVoice-1B is a foundational text-to-speech (TTS) model designed for generating human-like, expressive speech. It targets researchers and developers seeking high-quality, emotionally nuanced audio synthesis, offering zero-shot voice cloning and fine-tuning capabilities for diverse voice applications.

How It Works

The model predicts EnCodec tokens from text and speaker information, then diffuses these to waveform level. A causal GPT generates the initial EnCodec hierarchies, conditioned on speaker embeddings from a separate verification network. Condition-free sampling enhances cloning. A small, non-causal transformer predicts the remaining hierarchies, enabling parallel generation. Multi-band diffusion creates waveforms, with DeepFilterNet cleaning up artifacts for clearer audio.

Quick Start & Requirements

  • Install: poetry install && poetry run pip install torch==2.2.1 torchaudio==2.2.1 (Poetry recommended)
  • Prerequisites: GPU VRAM >=12GB, Python >=3.10,<3.12, ffmpeg, wget, rust.
  • Setup: Requires installing ffmpeg, rustup, and poetry.
  • Docs: API definitions (assuming server is running)

Highlighted Details

  • 1.2B parameter model trained on 100K hours of speech.
  • Zero-shot voice cloning with 30s reference audio (American & British English).
  • Fine-tuning supports cross-lingual cloning with as little as 1 minute of data.
  • Achieves Real-Time Factor (RTF) < 1.0 on modern GPUs after compilation.

Maintenance & Community

  • Supported by Together.ai, AWS, GCP, and Hugging Face.
  • Codebase based on NanoGPT and includes implementations from various researchers.

Licensing & Compatibility

  • Released under Apache 2.0 license, allowing unrestricted commercial use.

Limitations & Caveats

  • Synthesis of arbitrary length text is listed as upcoming.
  • Diffusion at the waveform level can introduce unpleasant background artifacts, though DeepFilterNet mitigates this.
  • Experimental quantization modes (int4, int8) offer faster inference but degrade audio quality.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
53 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.