Discover and explore top open-source AI tools and projects—updated daily.
Expressive Chinese-English spoken dialogue synthesis
Top 38.3% on SourcePulse
MOSS-TTSD is an open-source, bilingual (Chinese/English) text-to-speech model designed for expressive spoken dialogue generation. It enables zero-shot multi-speaker voice cloning and long-form speech synthesis, making it suitable for applications like AI podcast production.
How It Works
MOSS-TTSD leverages a unified semantic-acoustic neural audio codec, a pre-trained large language model, and extensive TTS data (millions of hours) including conversational speech. This architecture allows for highly expressive, human-like dialogue with natural prosody, supporting accurate speaker switching based on dialogue scripts and zero-shot voice cloning.
Quick Start & Requirements
conda
to create an environment (conda create -n moss_ttsd python=3.10 -y && conda activate moss_ttsd
), then pip install -r requirements.txt
and pip install flash-attn
. Download XY Tokenizer weights from its repository.flash-attn
.python inference.py --jsonl examples/examples.jsonl --output_dir outputs --seed 42 --use_normalize
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model may exhibit instability, including speaker switching errors and timbre cloning deviations, which are targeted for future optimization. Users are cautioned against misuse for unauthorized voice cloning, impersonation, or illegal activities.
3 days ago
Inactive