Autoregressive TTS model for streaming speech from any LLM
Top 96.7% on sourcepulse
LLMVoX is a lightweight, LLM-agnostic autoregressive streaming Text-to-Speech (TTS) system designed for generating high-fidelity speech from LLM outputs with low latency. It targets developers and researchers building conversational AI, voice assistants, and multimodal applications, enabling seamless integration with any LLM or Vision-Language Model for real-time speech synthesis.
How It Works
LLMVoX employs a multi-queue streaming architecture with two TTS model replicas to achieve low latency and high-quality speech. It processes text chunks, with initial chunks being smaller for faster response times and subsequent chunks increasing in size for better audio quality. This approach allows for continuous speech generation and infinite-length dialogues, outperforming traditional speech-enabled LLMs in Word Error Rate while maintaining comparable latency and quality.
Quick Start & Requirements
configs/inference_config.py
with checkpoint paths.python streaming_server.py
with various arguments for different chat types (voice, text, visual speech, multimodal).python run_ui.py
.Highlighted Details
Maintenance & Community
The project is from Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI). Key dependencies include NanoGPT, WavTokenizer, Whisper, and Neural G2P.
Licensing & Compatibility
LLMVoX is released under the CC-BY-NC-SA 4.0 License, which restricts commercial use and requires similar sharing for derivative works.
Limitations & Caveats
Requires specific hardware (Ampere+ GPU, CUDA 11.7+) and dependencies like Flash Attention. The CC-BY-NC-SA 4.0 license may limit commercial adoption.
2 months ago
1 day