MOSS-TTSD  by OpenMOSS

Expressive Chinese-English spoken dialogue synthesis

Created 3 months ago
962 stars

Top 38.3% on SourcePulse

GitHubView on GitHub
Project Summary

MOSS-TTSD is an open-source, bilingual (Chinese/English) text-to-speech model designed for expressive spoken dialogue generation. It enables zero-shot multi-speaker voice cloning and long-form speech synthesis, making it suitable for applications like AI podcast production.

How It Works

MOSS-TTSD leverages a unified semantic-acoustic neural audio codec, a pre-trained large language model, and extensive TTS data (millions of hours) including conversational speech. This architecture allows for highly expressive, human-like dialogue with natural prosody, supporting accurate speaker switching based on dialogue scripts and zero-shot voice cloning.

Quick Start & Requirements

  • Installation: Use conda to create an environment (conda create -n moss_ttsd python=3.10 -y && conda activate moss_ttsd), then pip install -r requirements.txt and pip install flash-attn. Download XY Tokenizer weights from its repository.
  • Prerequisites: Python 3.10, flash-attn.
  • Usage: Run local inference with python inference.py --jsonl examples/examples.jsonl --output_dir outputs --seed 42 --use_normalize.
  • Demos: Available on Hugging Face Spaces and via blog posts.

Highlighted Details

  • Generates highly expressive, human-like dialogue with natural conversational prosody.
  • Supports zero-shot two-speaker voice cloning with accurate speaker switching.
  • Enables Chinese-English bilingual speech generation.
  • Optimized for long-form speech generation.

Maintenance & Community

  • Recent releases include v0.5 (enhanced timbre switching, voice cloning, stability) and v0.
  • Provides a podcast generation pipeline (Podever).
  • Fine-tuning scripts and tools are available.

Licensing & Compatibility

  • Released under the Apache 2.0 license.
  • Supports free commercial use.

Limitations & Caveats

The model may exhibit instability, including speaker switching errors and timbre cloning deviations, which are targeted for future optimization. Users are cautioned against misuse for unauthorized voice cloning, impersonation, or illegal activities.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
18
Star History
127 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.