Unmute enables text-based Large Language Models (LLMs) to interact audibly, facilitating real-time voice conversations. It's designed for users and developers seeking to integrate speech capabilities into LLM applications, offering a low-latency, flexible system.
How It Works
Unmute employs a pipeline where user speech is transcribed by a Speech-to-Text (STT) model, the resulting text is processed by an LLM, and the LLM's text response is converted to speech by a Text-to-Speech (TTS) model. This architecture prioritizes low latency by optimizing STT and TTS components and allowing integration with various LLM backends like VLLM or external APIs.
Quick Start & Requirements
- Installation: Recommended via Docker Compose (
docker compose up --build
).
- Hardware: GPU with CUDA support and at least 16 GB memory.
- OS: Linux or Windows with WSL. macOS is not supported.
- Dependencies: NVIDIA Container Toolkit for Docker. Hugging Face Hub token for LLM access.
- Setup: Docker Compose setup is described as "Very easy."
- Documentation: Unmute.sh
Highlighted Details
- Achieves ~450ms TTS latency on a multi-GPU setup, down from ~750ms on a single GPU.
- Supports running STT, TTS, and LLM on separate GPUs for performance gains.
- Frontend is a Next.js app; backend communicates via a protocol based on OpenAI Realtime API.
- Includes a load testing client for measuring latency and throughput.
Maintenance & Community
- Project actively encourages issue reporting for troubleshooting.
- Development pointers are provided for modifying voices, prompts, and swapping frontends.
- Contributions for features like tool calling are welcomed.
Licensing & Compatibility
- No explicit license is mentioned in the README.
- Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
- Native macOS support is not provided.
- HTTPS support is omitted from default Docker Compose and Dockerless setups.
- Docker Swarm deployment is documented for internal use but not supported for debugging.