metavoice-src by metavoiceio

TTS model for human-like, expressive speech

Created 1 year ago

4,196 stars

Top 11.6% on SourcePulse

4 Experts Love This Project

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

doriandarko

Pietro Schirano

Founder of MagicPath

tjbck

Founder of Open WebUI

codekansas

Cofounder of K-Scale Labs

Project Summary

MetaVoice-1B is a foundational text-to-speech (TTS) model designed for generating human-like, expressive speech. It targets researchers and developers seeking high-quality, emotionally nuanced audio synthesis, offering zero-shot voice cloning and fine-tuning capabilities for diverse voice applications.

How It Works

The model predicts EnCodec tokens from text and speaker information, then diffuses these to waveform level. A causal GPT generates the initial EnCodec hierarchies, conditioned on speaker embeddings from a separate verification network. Condition-free sampling enhances cloning. A small, non-causal transformer predicts the remaining hierarchies, enabling parallel generation. Multi-band diffusion creates waveforms, with DeepFilterNet cleaning up artifacts for clearer audio.

Quick Start & Requirements

Install: poetry install && poetry run pip install torch==2.2.1 torchaudio==2.2.1 (Poetry recommended)
Prerequisites: GPU VRAM >=12GB, Python >=3.10,<3.12, ffmpeg, wget, rust.
Setup: Requires installing ffmpeg, rustup, and poetry.
Docs: API definitions (assuming server is running)

Highlighted Details

1.2B parameter model trained on 100K hours of speech.
Zero-shot voice cloning with 30s reference audio (American & British English).
Fine-tuning supports cross-lingual cloning with as little as 1 minute of data.
Achieves Real-Time Factor (RTF) < 1.0 on modern GPUs after compilation.

Maintenance & Community

Supported by Together.ai, AWS, GCP, and Hugging Face.
Codebase based on NanoGPT and includes implementations from various researchers.

Licensing & Compatibility

Released under Apache 2.0 license, allowing unrestricted commercial use.

Limitations & Caveats

Synthesis of arbitrary length text is listed as upcoming.
Diffusion at the waveform level can introduce unpleasant background artifacts, though DeepFilterNet mitigates this.
Experimental quantization modes (int4, int8) offer faster inference but degrade audio quality.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

7 stars in the last 30 days

Explore Similar Projects

MahaTTS by dubverse-ai

Open-source TTS model for multilingual voice cloning

Created 2 years ago

Updated 1 year ago

SpeechGPT-2.0-preview by OpenMOSS

Real-time spoken dialogue system with GPT-4o-level capabilities

Created 11 months ago

Updated 11 months ago

Step-Audio-EditX by stepfun-ai

LLM-driven audio model for expressive editing and TTS

Created 2 months ago

Updated 2 weeks ago

FireRedTTS by FireRedTeam

LLM-empowered TTS system for research

Created 1 year ago

Updated 3 months ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm).

MARS5-TTS by Camb-ai

Speech model (TTS) for prosody generation

Created 1 year ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT), and

3 more.

WhisperSpeech by WhisperSpeech

Open-source text-to-speech system built by inverting Whisper

Created 2 years ago

Updated 4 weeks ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

1 more.

higgs-audio by boson-ai

Expressive text-to-audio generation model

Created 5 months ago

Updated 3 months ago

VITS-fast-fine-tuning by Plachtaa

VITS pipeline for fast speaker adaptation TTS and voice conversion

Created 2 years ago

Updated 11 months ago

Spark-TTS by SparkAudio

PyTorch code for efficient LLM-based text-to-speech inference

Created 10 months ago

Updated 9 months ago

CosyVoice by FunAudioLLM

Voice generation model for inference, training, and deployment

Created 1 year ago

Updated 4 days ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen), and

6 more.

OpenVoice by myshell-ai

Audio foundation model for versatile, instant voice cloning

Created 2 years ago

Updated 8 months ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

Few-shot voice cloning and TTS web UI

Created 2 years ago

Updated 1 week ago

Feedback? Help us improve.