metavoice-src  by metavoiceio

TTS model for human-like, expressive speech

Created 1 year ago
4,194 stars

Top 11.8% on SourcePulse

GitHubView on GitHub
Project Summary

MetaVoice-1B is a foundational text-to-speech (TTS) model designed for generating human-like, expressive speech. It targets researchers and developers seeking high-quality, emotionally nuanced audio synthesis, offering zero-shot voice cloning and fine-tuning capabilities for diverse voice applications.

How It Works

The model predicts EnCodec tokens from text and speaker information, then diffuses these to waveform level. A causal GPT generates the initial EnCodec hierarchies, conditioned on speaker embeddings from a separate verification network. Condition-free sampling enhances cloning. A small, non-causal transformer predicts the remaining hierarchies, enabling parallel generation. Multi-band diffusion creates waveforms, with DeepFilterNet cleaning up artifacts for clearer audio.

Quick Start & Requirements

  • Install: poetry install && poetry run pip install torch==2.2.1 torchaudio==2.2.1 (Poetry recommended)
  • Prerequisites: GPU VRAM >=12GB, Python >=3.10,<3.12, ffmpeg, wget, rust.
  • Setup: Requires installing ffmpeg, rustup, and poetry.
  • Docs: API definitions (assuming server is running)

Highlighted Details

  • 1.2B parameter model trained on 100K hours of speech.
  • Zero-shot voice cloning with 30s reference audio (American & British English).
  • Fine-tuning supports cross-lingual cloning with as little as 1 minute of data.
  • Achieves Real-Time Factor (RTF) < 1.0 on modern GPUs after compilation.

Maintenance & Community

  • Supported by Together.ai, AWS, GCP, and Hugging Face.
  • Codebase based on NanoGPT and includes implementations from various researchers.

Licensing & Compatibility

  • Released under Apache 2.0 license, allowing unrestricted commercial use.

Limitations & Caveats

  • Synthesis of arbitrary length text is listed as upcoming.
  • Diffusion at the waveform level can introduce unpleasant background artifacts, though DeepFilterNet mitigates this.
  • Experimental quantization modes (int4, int8) offer faster inference but degrade audio quality.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
35 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.4%
35k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 6 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
52k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.