MOSS-TTSD by OpenMOSS

Expressive Chinese-English spoken dialogue synthesis

Created 8 months ago

1,179 stars

Top 32.8% on SourcePulse

Project Summary

MOSS-TTSD is an open-source, bilingual (Chinese/English) text-to-speech model designed for expressive spoken dialogue generation. It enables zero-shot multi-speaker voice cloning and long-form speech synthesis, making it suitable for applications like AI podcast production.

How It Works

MOSS-TTSD leverages a unified semantic-acoustic neural audio codec, a pre-trained large language model, and extensive TTS data (millions of hours) including conversational speech. This architecture allows for highly expressive, human-like dialogue with natural prosody, supporting accurate speaker switching based on dialogue scripts and zero-shot voice cloning.

Quick Start & Requirements

Installation: Use conda to create an environment (conda create -n moss_ttsd python=3.10 -y && conda activate moss_ttsd), then pip install -r requirements.txt and pip install flash-attn. Download XY Tokenizer weights from its repository.
Prerequisites: Python 3.10, flash-attn.
Usage: Run local inference with python inference.py --jsonl examples/examples.jsonl --output_dir outputs --seed 42 --use_normalize.
Demos: Available on Hugging Face Spaces and via blog posts.

Highlighted Details

Generates highly expressive, human-like dialogue with natural conversational prosody.
Supports zero-shot two-speaker voice cloning with accurate speaker switching.
Enables Chinese-English bilingual speech generation.
Optimized for long-form speech generation.

Maintenance & Community

Recent releases include v0.5 (enhanced timbre switching, voice cloning, stability) and v0.
Provides a podcast generation pipeline (Podever).
Fine-tuning scripts and tools are available.

Licensing & Compatibility

Released under the Apache 2.0 license.
Supports free commercial use.

Limitations & Caveats

The model may exhibit instability, including speaker switching errors and timbre cloning deviations, which are targeted for future optimization. Users are cautioned against misuse for unauthorized voice cloning, impersonation, or illegal activities.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

101 stars in the last 30 days