MisoTTS by MisoLabsAI

Emotive, 8B parameter text-to-speech model

Created 2 months ago

3,150 stars

Top 14.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

Summary

Miso TTS 8B is a state-of-the-art, emotive text-to-speech model designed for generating high-quality, conversational speech. It targets developers and researchers needing advanced TTS capabilities, offering natural voice synthesis and voice cloning features.

How It Works

The model utilizes an RVQ Transformer architecture, featuring a large Llama 3.2-style 8B parameter backbone for processing text and audio embeddings. A smaller, 300M parameter autoregressive audio decoder predicts audio codes. This design enables conditioning on optional audio context for voice cloning and aims for emotive, high-fidelity speech generation.

Quick Start & Requirements

Installation can be managed via uv or standard pip. Clone the repository, then use uv sync --python 3.10 or pip install -e . to set up the environment. Execution is via uv run python run_misotts.py or python run_misotts.py. Key requirements include Python 3.10, a high-VRAM GPU (24GB+ recommended for bfloat16, 40GB+ for float32), ~20-40GB RAM for CPU inference, and ~30-40GB disk space for initial model and dependency downloads. A demo is available at misolabs.ai.

Highlighted Details

Features an 8.2 billion parameter model comprising an 8B backbone and a 300M audio decoder.
Employs an RVQ Transformer architecture with a Mimi audio tokenizer.
Supports prompted generation for voice cloning capabilities.
Generated audio is watermarked by default using the SilentCipher model.

Maintenance & Community

Project information and resources are available via the website (misolabs.ai), Hugging Face (MisoLabs/MisoTTS), GitHub (MisoLabsAI/MisoTTS), and X (@MisoLabsAI). No specific community channels like Discord or Slack are listed.

Licensing & Compatibility

The provided README does not explicitly state the software license. This omission requires clarification for assessing commercial use or derivative works.

Limitations & Caveats

The model currently supports English language synthesis exclusively. High GPU VRAM requirements (minimum 24GB recommended) make it unsuitable for low-resource hardware. Initial model and dependency downloads are substantial (~30-40GB), and CPU inference is notably slow.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

138 stars in the last 30 days