MARS5-TTS by Camb-ai

Speech model (TTS) for prosody generation

Created 1 year ago

2,813 stars

Top 16.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

MARS5-TTS is an open-source English text-to-speech model from CAMB.AI, designed for generating speech with highly natural prosody, even in challenging scenarios like sports commentary or anime. It targets researchers and developers needing advanced TTS capabilities, offering a novel two-stage AR-NAR pipeline for high-quality voice cloning and prosody control.

How It Works

MARS5 employs a two-stage approach: an autoregressive transformer generates coarse speech features, followed by a multinomial diffusion model (DDPM) that refines these features. This pipeline allows for fine-grained control over prosody via punctuation and capitalization in the input text. Speaker identity is captured using 2-12 second audio references, with optional reference transcripts enabling "deep cloning" for enhanced quality.

Quick Start & Requirements

Installation: pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex
Prerequisites: Python >= 3.10, Torch >= 2.0.
Usage: Load models via torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True) or from Hugging Face.
Demo: Colab Quickstart

Highlighted Details

Generates high-quality speech from as little as 5 seconds of audio reference.
Enables prosody control through text punctuation and capitalization.
Supports "deep clone" for improved voice cloning quality by providing reference audio transcripts.
Checkpoints available in PyTorch (.pt) and safetensors formats.

Maintenance & Community

Active development with recent updates to AR checkpoints.
Community channels available via Discord.
Model available on HuggingFace.

Licensing & Compatibility

Licensed under GNU AGPL 3.0 for the English version.
Commercial licensing inquiries should be directed to help@camb.ai.

Limitations & Caveats

The model is primarily English-focused, and long-form generation is not natively supported, requiring chunking strategies. Performance on Apple's MPS backend may be impacted by unsupported operators.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days