Local inference server for ASR/STT, TTS, and LLM tasks
Top 66.2% on sourcepulse
This project provides a highly optimized, self-hosted inference server for language tasks, including Automatic Speech Recognition (ASR/STT), Text-to-Speech (TTS), and Large Language Models (LLMs). It targets users seeking cost-effective, real-time speech and language processing on local hardware, from low-end GPUs to high-end cards, with CPU-only support also available.
How It Works
Willow Inference Server (WIS) leverages CTranslate2 for optimized Whisper ASR and AutoGPTQ for LLMs, enabling efficient inference. It prioritizes low-latency, high-quality speech recognition via WebRTC, REST, and WebSockets. The server supports real-time audio streaming, custom TTS voice creation, and LLM integration with int4 quantization for memory savings. It automatically detects and optimizes for available CUDA VRAM and compute capabilities.
Quick Start & Requirements
./utils.sh install
.nvidia-container-toolkit
../utils.sh gen-cert [your hostname]
and start the server with ./utils.sh run
.https://[your host]:19000/api/docs
. WebRTC demo client at https://[your host]:19000/rtc
.Highlighted Details
Maintenance & Community
The project is described as "very early and advancing rapidly," encouraging community contributions. Future plans include ready-to-deploy Docker containers for the 1.0 release.
Licensing & Compatibility
The repository does not explicitly state a license in the README.
Limitations & Caveats
The project is in early development, with rapid changes expected. CPU optimization is a stated area for community contribution, as current CPU performance does not meet the project's latency targets.
1 month ago
1 day