Real-time audio transcription server using self-hosted Whisper
Top 41.2% on sourcepulse
VoiceStreamAI provides a near-realtime audio transcription solution using self-hosted Whisper and WebSocket. It targets developers and users needing to transcribe streaming audio with customizable VAD and ASR components, offering a modular Python server and JavaScript client. The primary benefit is efficient, accurate, and low-latency speech-to-text for various applications.
How It Works
The system leverages WebSocket for real-time audio streaming between a JavaScript client and a Python server. The server utilizes Huggingface's Voice Activity Detection (VAD) to isolate speech segments, reducing computational load and improving accuracy. Transcription is handled by Faster Whisper (default) or OpenAI's Whisper, with configurable chunking strategies to balance latency and completeness. Its modular design, employing factory and strategy patterns, allows easy integration of alternative VAD and ASR technologies.
Quick Start & Requirements
pip install -r requirements.txt
(Python 3.8+)sudo docker build -t voicestreamai .
and run with sudo docker run --gpus all -p 8765:8765 -e PYANNOTE_AUTH_TOKEN='VAD_TOKEN_HERE' voicestreamai
. Requires NVIDIA Container Toolkit for GPU acceleration.https://huggingface.co/pyannote/segmentation
).client/index.html
in a modern web browser.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The system's processing strategy ("SilenceAtEndOfChunk") introduces extra latency for dense speech segments as transcription waits for a pause. Whisper's accuracy can vary with small audio chunks due to context loss. The current implementation saves chunks to files before processing.
10 months ago
1 day