LLMVoX by mbzuai-oryx

Autoregressive TTS model for streaming speech from any LLM

Created 10 months ago

291 stars

Top 90.7% on SourcePulse

Project Summary

LLMVoX is a lightweight, LLM-agnostic autoregressive streaming Text-to-Speech (TTS) system designed for generating high-fidelity speech from LLM outputs with low latency. It targets developers and researchers building conversational AI, voice assistants, and multimodal applications, enabling seamless integration with any LLM or Vision-Language Model for real-time speech synthesis.

How It Works

LLMVoX employs a multi-queue streaming architecture with two TTS model replicas to achieve low latency and high-quality speech. It processes text chunks, with initial chunks being smaller for faster response times and subsequent chunks increasing in size for better audio quality. This approach allows for continuous speech generation and infinite-length dialogues, outperforming traditional speech-enabled LLMs in Word Error Rate while maintaining comparable latency and quality.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment with Python 3.9, install PyTorch with CUDA 11.8, Flash Attention 2.0+, and other dependencies.
Prerequisites: CUDA 11.7+, Flash Attention 2.0+ compatible GPU (Ampere or newer), Python 3.9.
Checkpoints: Download necessary model checkpoints from the Hugging Face repository (MBZUAI/LLMVoX).
Configuration: Update configs/inference_config.py with checkpoint paths.
Running: Execute python streaming_server.py with various arguments for different chat types (voice, text, visual speech, multimodal).
Demo UI: Launch a local demo UI with python run_ui.py.
Documentation: Refer to the README for detailed configuration and usage examples.

Highlighted Details

30M parameter model size.
End-to-end latency as low as 300ms.
Supports voice chat, text chat, visual speech, and multimodal chat.
Integrates with ASR models like Whisper.
Customizable streaming chunk sizes and LLM parameters.

Maintenance & Community

The project is from Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI). Key dependencies include NanoGPT, WavTokenizer, Whisper, and Neural G2P.

Licensing & Compatibility

LLMVoX is released under the CC-BY-NC-SA 4.0 License, which restricts commercial use and requires similar sharing for derivative works.

Limitations & Caveats

Requires specific hardware (Ampere+ GPU, CUDA 11.7+) and dependencies like Flash Attention. The CC-BY-NC-SA 4.0 license may limit commercial adoption.

LLMVoX by mbzuai-oryx

Explore Similar Projects

LLaMA-Omni2 by ictnlp

local_llm_assistant by nickbild

willow-inference-server by toverainc

FireRedTTS by FireRedTeam

fast-voice-assistant by dsa

10x by 0xCrunchyy

ichigo by janhq

QuickAgent by gkamradt

unmute by kyutai-labs

mini-omni by gpt-omni

ultravox by fixie-ai

Orpheus-TTS by canopyai