GLM-TTS  by zai-org

Controllable, emotion-expressive zero-shot TTS

Created 1 month ago
847 stars

Top 42.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

GLM-TTS is a high-quality, controllable, and emotion-expressive zero-shot text-to-speech (TTS) system. It targets researchers and developers seeking advanced TTS capabilities, offering natural emotional speech synthesis and real-time streaming inference, significantly improving upon traditional TTS expressiveness.

How It Works

The system uses a two-stage architecture: an LLM generates speech token sequences, followed by a Flow Matching model converting tokens to audio waveforms. A novel Multi-Reward Reinforcement Learning (RL) framework with GRPO optimizes LLM generation for enhanced emotional expressiveness and prosody. Zero-shot voice cloning is achieved from 3-10 second audio prompts without fine-tuning.

Quick Start & Requirements

  • Environment: Python 3.10-3.12.
  • Installation: Clone repo, pip install -r requirements.txt. Optional RL dependencies require cloning s3prl, omine-me/LaughterSegmentation into grpo/modules, and downloading wavlm_large_finetune.pth.
  • Models: Download from HuggingFace (huggingface-cli download zai-org/GLM-TTS --local-dir ckpt) or ModelScope (modelscope download --model ZhipuAI/GLM-TTS --local_dir ckpt).
  • Inference: python glmtts_inference.py, bash glmtts_inference.sh, or interactive python -m tools.gradio_app.
  • Links: HuggingFace, ModelScope.

Highlighted Details

  • RL-enhanced Emotion Control: Multi-reward RL (GRPO) optimizes for sound quality, similarity, emotion, and laughter, reducing CER from 1.03 to 0.89.
  • Zero-shot Voice Cloning: Clones voices from 3-10s prompt audio.
  • Streaming Inference: Supports real-time audio generation.
  • Phoneme-level Control: Fine-grained pronunciation control via "Hybrid Phoneme + Text" input.

Maintenance & Community

Officially open-sourced December 11, 2025. Specific community links (Discord, Slack) or roadmap are not detailed in the README.

Licensing & Compatibility

The README states "open-sourced" but omits a specific license type. Commercial use or closed-source linking compatibility is not detailed. Example prompt audio is restricted to research use.

Limitations & Caveats

The project is newly open-sourced with "Coming Soon" features (e.g., 2D Vocos vocoder, RL-optimized weights). Example prompt audio is for research use only. No specific unsupported platforms or known bugs are detailed.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
27
Star History
336 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.2%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 10 months ago
Updated 1 month ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.4%
54k
Few-shot voice cloning and TTS web UI
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.