MARS5-TTS  by Camb-ai

Speech model (TTS) for prosody generation

created 1 year ago
2,785 stars

Top 17.5% on sourcepulse

GitHubView on GitHub
Project Summary

MARS5-TTS is an open-source English text-to-speech model from CAMB.AI, designed for generating speech with highly natural prosody, even in challenging scenarios like sports commentary or anime. It targets researchers and developers needing advanced TTS capabilities, offering a novel two-stage AR-NAR pipeline for high-quality voice cloning and prosody control.

How It Works

MARS5 employs a two-stage approach: an autoregressive transformer generates coarse speech features, followed by a multinomial diffusion model (DDPM) that refines these features. This pipeline allows for fine-grained control over prosody via punctuation and capitalization in the input text. Speaker identity is captured using 2-12 second audio references, with optional reference transcripts enabling "deep cloning" for enhanced quality.

Quick Start & Requirements

  • Installation: pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex
  • Prerequisites: Python >= 3.10, Torch >= 2.0.
  • Usage: Load models via torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True) or from Hugging Face.
  • Demo: Colab Quickstart

Highlighted Details

  • Generates high-quality speech from as little as 5 seconds of audio reference.
  • Enables prosody control through text punctuation and capitalization.
  • Supports "deep clone" for improved voice cloning quality by providing reference audio transcripts.
  • Checkpoints available in PyTorch (.pt) and safetensors formats.

Maintenance & Community

  • Active development with recent updates to AR checkpoints.
  • Community channels available via Discord.
  • Model available on HuggingFace.

Licensing & Compatibility

  • Licensed under GNU AGPL 3.0 for the English version.
  • Commercial licensing inquiries should be directed to help@camb.ai.

Limitations & Caveats

The model is primarily English-focused, and long-form generation is not natively supported, requiring chunking strategies. Performance on Apple's MPS backend may be impacted by unsupported operators.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
67 stars in the last 90 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Joe Walnes Joe Walnes(Head of Experimental Projects at Stripe), and
1 more.

chatterbox by resemble-ai

1.6%
10k
Open-source TTS model
created 3 months ago
updated 1 day ago
Feedback? Help us improve.