Discover and explore top open-source AI tools and projects—updated daily.
zai-orgControllable, emotion-expressive zero-shot TTS
Top 42.2% on SourcePulse
Summary
GLM-TTS is a high-quality, controllable, and emotion-expressive zero-shot text-to-speech (TTS) system. It targets researchers and developers seeking advanced TTS capabilities, offering natural emotional speech synthesis and real-time streaming inference, significantly improving upon traditional TTS expressiveness.
How It Works
The system uses a two-stage architecture: an LLM generates speech token sequences, followed by a Flow Matching model converting tokens to audio waveforms. A novel Multi-Reward Reinforcement Learning (RL) framework with GRPO optimizes LLM generation for enhanced emotional expressiveness and prosody. Zero-shot voice cloning is achieved from 3-10 second audio prompts without fine-tuning.
Quick Start & Requirements
pip install -r requirements.txt. Optional RL dependencies require cloning s3prl, omine-me/LaughterSegmentation into grpo/modules, and downloading wavlm_large_finetune.pth.huggingface-cli download zai-org/GLM-TTS --local-dir ckpt) or ModelScope (modelscope download --model ZhipuAI/GLM-TTS --local_dir ckpt).python glmtts_inference.py, bash glmtts_inference.sh, or interactive python -m tools.gradio_app.Highlighted Details
Maintenance & Community
Officially open-sourced December 11, 2025. Specific community links (Discord, Slack) or roadmap are not detailed in the README.
Licensing & Compatibility
The README states "open-sourced" but omits a specific license type. Commercial use or closed-source linking compatibility is not detailed. Example prompt audio is restricted to research use.
Limitations & Caveats
The project is newly open-sourced with "Coming Soon" features (e.g., 2D Vocos vocoder, RL-optimized weights). Example prompt audio is for research use only. No specific unsupported platforms or known bugs are detailed.
3 weeks ago
Inactive
canopyai
RVC-Boss