MisoTTS  by MisoLabsAI

Emotive, 8B parameter text-to-speech model

Created 3 weeks ago

New!

2,694 stars

Top 17.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Miso TTS 8B is a state-of-the-art, emotive text-to-speech model designed for generating high-quality, conversational speech. It targets developers and researchers needing advanced TTS capabilities, offering natural voice synthesis and voice cloning features.

How It Works

The model utilizes an RVQ Transformer architecture, featuring a large Llama 3.2-style 8B parameter backbone for processing text and audio embeddings. A smaller, 300M parameter autoregressive audio decoder predicts audio codes. This design enables conditioning on optional audio context for voice cloning and aims for emotive, high-fidelity speech generation.

Quick Start & Requirements

Installation can be managed via uv or standard pip. Clone the repository, then use uv sync --python 3.10 or pip install -e . to set up the environment. Execution is via uv run python run_misotts.py or python run_misotts.py. Key requirements include Python 3.10, a high-VRAM GPU (24GB+ recommended for bfloat16, 40GB+ for float32), ~20-40GB RAM for CPU inference, and ~30-40GB disk space for initial model and dependency downloads. A demo is available at misolabs.ai.

Highlighted Details

  • Features an 8.2 billion parameter model comprising an 8B backbone and a 300M audio decoder.
  • Employs an RVQ Transformer architecture with a Mimi audio tokenizer.
  • Supports prompted generation for voice cloning capabilities.
  • Generated audio is watermarked by default using the SilentCipher model.

Maintenance & Community

Project information and resources are available via the website (misolabs.ai), Hugging Face (MisoLabs/MisoTTS), GitHub (MisoLabsAI/MisoTTS), and X (@MisoLabsAI). No specific community channels like Discord or Slack are listed.

Licensing & Compatibility

The provided README does not explicitly state the software license. This omission requires clarification for assessing commercial use or derivative works.

Limitations & Caveats

The model currently supports English language synthesis exclusively. High GPU VRAM requirements (minimum 24GB recommended) make it unsuitable for low-resource hardware. Initial model and dependency downloads are substantial (~30-40GB), and CPU inference is notably slow.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
13
Issues (30d)
15
Star History
2,698 stars in the last 23 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.3%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 1 year ago
Updated 6 months ago
Feedback? Help us improve.