willow-inference-server by toverainc

Local inference server for ASR/STT, TTS, and LLM tasks

Created 2 years ago

485 stars

Top 63.4% on SourcePulse

Project Summary

This project provides a highly optimized, self-hosted inference server for language tasks, including Automatic Speech Recognition (ASR/STT), Text-to-Speech (TTS), and Large Language Models (LLMs). It targets users seeking cost-effective, real-time speech and language processing on local hardware, from low-end GPUs to high-end cards, with CPU-only support also available.

How It Works

Willow Inference Server (WIS) leverages CTranslate2 for optimized Whisper ASR and AutoGPTQ for LLMs, enabling efficient inference. It prioritizes low-latency, high-quality speech recognition via WebRTC, REST, and WebSockets. The server supports real-time audio streaming, custom TTS voice creation, and LLM integration with int4 quantization for memory savings. It automatically detects and optimizes for available CUDA VRAM and compute capabilities.

Quick Start & Requirements

Install: Clone the repository, then run ./utils.sh install.
Prerequisites: NVIDIA drivers (version 530 recommended), nvidia-container-toolkit.
Run: Generate TLS certificates with ./utils.sh gen-cert [your hostname] and start the server with ./utils.sh run.
Docs: API documentation available at https://[your host]:19000/api/docs. WebRTC demo client at https://[your host]:19000/rtc.

Highlighted Details

Optimized for low-end GPUs (e.g., GTX 1060 3GB) with simultaneous ASR+TTS support in under 6GB VRAM.
Real-time ASR transcription with WebRTC, achieving sub-hundred-millisecond response times.
Supports LLM integration with int4 quantization for memory efficiency.
Benchmarks show significant "realtime multiple" gains, especially with longer speech segments.

Maintenance & Community

The project is described as "very early and advancing rapidly," encouraging community contributions. Future plans include ready-to-deploy Docker containers for the 1.0 release.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is in early development, with rapid changes expected. CPU optimization is a stated area for community contribution, as current CPU performance does not meet the project's latency targets.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days