LLMVoX  by mbzuai-oryx

Autoregressive TTS model for streaming speech from any LLM

created 4 months ago
267 stars

Top 96.7% on sourcepulse

GitHubView on GitHub
Project Summary

LLMVoX is a lightweight, LLM-agnostic autoregressive streaming Text-to-Speech (TTS) system designed for generating high-fidelity speech from LLM outputs with low latency. It targets developers and researchers building conversational AI, voice assistants, and multimodal applications, enabling seamless integration with any LLM or Vision-Language Model for real-time speech synthesis.

How It Works

LLMVoX employs a multi-queue streaming architecture with two TTS model replicas to achieve low latency and high-quality speech. It processes text chunks, with initial chunks being smaller for faster response times and subsequent chunks increasing in size for better audio quality. This approach allows for continuous speech generation and infinite-length dialogues, outperforming traditional speech-enabled LLMs in Word Error Rate while maintaining comparable latency and quality.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment with Python 3.9, install PyTorch with CUDA 11.8, Flash Attention 2.0+, and other dependencies.
  • Prerequisites: CUDA 11.7+, Flash Attention 2.0+ compatible GPU (Ampere or newer), Python 3.9.
  • Checkpoints: Download necessary model checkpoints from the Hugging Face repository (MBZUAI/LLMVoX).
  • Configuration: Update configs/inference_config.py with checkpoint paths.
  • Running: Execute python streaming_server.py with various arguments for different chat types (voice, text, visual speech, multimodal).
  • Demo UI: Launch a local demo UI with python run_ui.py.
  • Documentation: Refer to the README for detailed configuration and usage examples.

Highlighted Details

  • 30M parameter model size.
  • End-to-end latency as low as 300ms.
  • Supports voice chat, text chat, visual speech, and multimodal chat.
  • Integrates with ASR models like Whisper.
  • Customizable streaming chunk sizes and LLM parameters.

Maintenance & Community

The project is from Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI). Key dependencies include NanoGPT, WavTokenizer, Whisper, and Neural G2P.

Licensing & Compatibility

LLMVoX is released under the CC-BY-NC-SA 4.0 License, which restricts commercial use and requires similar sharing for derivative works.

Limitations & Caveats

Requires specific hardware (Ampere+ GPU, CUDA 11.7+) and dependencies like Flash Attention. The CC-BY-NC-SA 4.0 license may limit commercial adoption.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

MiniCPM-o by OpenBMB

0.2%
20k
MLLM for vision, speech, and multimodal live streaming on your phone
created 1 year ago
updated 1 month ago
Feedback? Help us improve.