GLM-TTS by zai-org

Controllable, emotion-expressive zero-shot TTS

Created 2 months ago

936 stars

Top 38.9% on SourcePulse

Project Summary

Summary

GLM-TTS is a high-quality, controllable, and emotion-expressive zero-shot text-to-speech (TTS) system. It targets researchers and developers seeking advanced TTS capabilities, offering natural emotional speech synthesis and real-time streaming inference, significantly improving upon traditional TTS expressiveness.

How It Works

The system uses a two-stage architecture: an LLM generates speech token sequences, followed by a Flow Matching model converting tokens to audio waveforms. A novel Multi-Reward Reinforcement Learning (RL) framework with GRPO optimizes LLM generation for enhanced emotional expressiveness and prosody. Zero-shot voice cloning is achieved from 3-10 second audio prompts without fine-tuning.

Quick Start & Requirements

Environment: Python 3.10-3.12.
Installation: Clone repo, pip install -r requirements.txt. Optional RL dependencies require cloning s3prl, omine-me/LaughterSegmentation into grpo/modules, and downloading wavlm_large_finetune.pth.
Models: Download from HuggingFace (huggingface-cli download zai-org/GLM-TTS --local-dir ckpt) or ModelScope (modelscope download --model ZhipuAI/GLM-TTS --local_dir ckpt).
Inference: python glmtts_inference.py, bash glmtts_inference.sh, or interactive python -m tools.gradio_app.
Links: HuggingFace, ModelScope.

Highlighted Details

RL-enhanced Emotion Control: Multi-reward RL (GRPO) optimizes for sound quality, similarity, emotion, and laughter, reducing CER from 1.03 to 0.89.
Zero-shot Voice Cloning: Clones voices from 3-10s prompt audio.
Streaming Inference: Supports real-time audio generation.
Phoneme-level Control: Fine-grained pronunciation control via "Hybrid Phoneme + Text" input.

Maintenance & Community

Officially open-sourced December 11, 2025. Specific community links (Discord, Slack) or roadmap are not detailed in the README.

Licensing & Compatibility

The README states "open-sourced" but omits a specific license type. Commercial use or closed-source linking compatibility is not detailed. Example prompt audio is restricted to research use.

Limitations & Caveats

The project is newly open-sourced with "Coming Soon" features (e.g., 2D Vocos vocoder, RL-optimized weights). Example prompt audio is for research use only. No specific unsupported platforms or known bugs are detailed.

GLM-TTS by zai-org

Explore Similar Projects

ComfyUI-VoxCPM by wildminder

FireRedTTS by FireRedTeam

indexTTS2 by iszhanjiawei

ComfyUI-Qwen-TTS by flybirdxx

ZipVoice by k2-fsa

FireRedTTS2 by FireRedTeam

Orpheus-TTS by canopyai

higgs-audio by boson-ai

VoxCPM by OpenBMB

VALL-E-X by Plachtaa

dia by nari-labs

GPT-SoVITS by RVC-Boss