willow-inference-server  by toverainc

Local inference server for ASR/STT, TTS, and LLM tasks

created 2 years ago
465 stars

Top 66.2% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a highly optimized, self-hosted inference server for language tasks, including Automatic Speech Recognition (ASR/STT), Text-to-Speech (TTS), and Large Language Models (LLMs). It targets users seeking cost-effective, real-time speech and language processing on local hardware, from low-end GPUs to high-end cards, with CPU-only support also available.

How It Works

Willow Inference Server (WIS) leverages CTranslate2 for optimized Whisper ASR and AutoGPTQ for LLMs, enabling efficient inference. It prioritizes low-latency, high-quality speech recognition via WebRTC, REST, and WebSockets. The server supports real-time audio streaming, custom TTS voice creation, and LLM integration with int4 quantization for memory savings. It automatically detects and optimizes for available CUDA VRAM and compute capabilities.

Quick Start & Requirements

  • Install: Clone the repository, then run ./utils.sh install.
  • Prerequisites: NVIDIA drivers (version 530 recommended), nvidia-container-toolkit.
  • Run: Generate TLS certificates with ./utils.sh gen-cert [your hostname] and start the server with ./utils.sh run.
  • Docs: API documentation available at https://[your host]:19000/api/docs. WebRTC demo client at https://[your host]:19000/rtc.

Highlighted Details

  • Optimized for low-end GPUs (e.g., GTX 1060 3GB) with simultaneous ASR+TTS support in under 6GB VRAM.
  • Real-time ASR transcription with WebRTC, achieving sub-hundred-millisecond response times.
  • Supports LLM integration with int4 quantization for memory efficiency.
  • Benchmarks show significant "realtime multiple" gains, especially with longer speech segments.

Maintenance & Community

The project is described as "very early and advancing rapidly," encouraging community contributions. Future plans include ready-to-deploy Docker containers for the 1.0 release.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is in early development, with rapid changes expected. CPU optimization is a stated area for community contribution, as current CPU performance does not meet the project's latency targets.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

0.9%
8k
Speech-to-text library for realtime applications
created 1 year ago
updated 3 weeks ago
Feedback? Help us improve.