willow-inference-server  by toverainc

Local inference server for ASR/STT, TTS, and LLM tasks

Created 3 years ago
499 stars

Top 62.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a highly optimized, self-hosted inference server for language tasks, including Automatic Speech Recognition (ASR/STT), Text-to-Speech (TTS), and Large Language Models (LLMs). It targets users seeking cost-effective, real-time speech and language processing on local hardware, from low-end GPUs to high-end cards, with CPU-only support also available.

How It Works

Willow Inference Server (WIS) leverages CTranslate2 for optimized Whisper ASR and AutoGPTQ for LLMs, enabling efficient inference. It prioritizes low-latency, high-quality speech recognition via WebRTC, REST, and WebSockets. The server supports real-time audio streaming, custom TTS voice creation, and LLM integration with int4 quantization for memory savings. It automatically detects and optimizes for available CUDA VRAM and compute capabilities.

Quick Start & Requirements

  • Install: Clone the repository, then run ./utils.sh install.
  • Prerequisites: NVIDIA drivers (version 530 recommended), nvidia-container-toolkit.
  • Run: Generate TLS certificates with ./utils.sh gen-cert [your hostname] and start the server with ./utils.sh run.
  • Docs: API documentation available at https://[your host]:19000/api/docs. WebRTC demo client at https://[your host]:19000/rtc.

Highlighted Details

  • Optimized for low-end GPUs (e.g., GTX 1060 3GB) with simultaneous ASR+TTS support in under 6GB VRAM.
  • Real-time ASR transcription with WebRTC, achieving sub-hundred-millisecond response times.
  • Supports LLM integration with int4 quantization for memory efficiency.
  • Benchmarks show significant "realtime multiple" gains, especially with longer speech segments.

Maintenance & Community

The project is described as "very early and advancing rapidly," encouraging community contributions. Future plans include ready-to-deploy Docker containers for the 1.0 release.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is in early development, with rapid changes expected. CPU optimization is a stated area for community contribution, as current CPU performance does not meet the project's latency targets.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.