MOSS-TTS  by OpenMOSS

Open-source speech and sound generation model family

Created 2 weeks ago

New!

668 stars

Top 50.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

MOSS-TTS Family provides an open-source suite for high-fidelity, high-expressiveness audio generation across complex scenarios, addressing single-model limitations. It targets engineers and researchers needing production-ready components for diverse needs like long-form speech, dialogue, voice design, and real-time streaming, enhancing audio content creation.

How It Works

The MOSS-TTS Family comprises five specialized models (MOSS-TTS, MOSS-TTSD, MOSS-VoiceGenerator, MOSS-TTS-Realtime, MOSS-SoundEffect) for modularity or pipeline composition. A core MOSS-Audio-Tokenizer, built on a "CNN-free" Causal Transformer, unifies audio representation, compressing 24kHz audio to 12.5Hz with high fidelity and native streaming support. This enables novel capabilities like reference-free voice design and specialized solutions for long-speech, expressive dialogue, and low-latency agents.

Quick Start & Requirements

  • Installation: Clone repo, cd MOSS-TTS, pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ..
  • Prerequisites: Python 3.12, CUDA >= 12.8, PyTorch 2.9.1+cu128, Torchaudio 2.9.1+cu128, Transformers 5.0.0. FlashAttention 2 optional.
  • Links: Huggingface Spaces: MOSS-TTS, MOSS-TTSD-v1.0, MOSS-VoiceGenerator.

Highlighted Details

  • MOSS-TTS: State-of-the-art on Seed-TTS-eval benchmark, rivaling closed-source systems, offering long-speech and fine-grained control.
  • MOSS-TTSD-v1.0: Industry-leading objective/subjective performance for expressive, multi-speaker dialogues, outperforming Doubao and Gemini 2.5-pro.
  • MOSS-VoiceGenerator: Excels in voice design, generating diverse voices/styles from text prompts without reference speech.
  • MOSS-Audio-Tokenizer: Compresses 24kHz audio to 12.5Hz with high fidelity (0.125-4kbps) and native streaming design.

Maintenance & Community

Recently released (Feb 2026), the README lacks contributor/community channel details. Information may be found via linked Huggingface spaces or GitHub.

Licensing & Compatibility

Licensed under Apache License 2.0, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

Optional FlashAttention 2 installation may fail on some hardware. As a new project, long-term maintenance and community adoption are TBD. Different model architectures present distinct trade-offs requiring careful selection.

Health Check
Last Commit

21 hours ago

Responsiveness

Inactive

Pull Requests (30d)
17
Issues (30d)
24
Star History
673 stars in the last 18 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.3%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 11 months ago
Updated 2 months ago
Feedback? Help us improve.