VoiceStreamAI by alesaccoia

Real-time audio transcription server using self-hosted Whisper

Created 2 years ago

946 stars

Top 38.7% on SourcePulse

Project Summary

VoiceStreamAI provides a near-realtime audio transcription solution using self-hosted Whisper and WebSocket. It targets developers and users needing to transcribe streaming audio with customizable VAD and ASR components, offering a modular Python server and JavaScript client. The primary benefit is efficient, accurate, and low-latency speech-to-text for various applications.

How It Works

The system leverages WebSocket for real-time audio streaming between a JavaScript client and a Python server. The server utilizes Huggingface's Voice Activity Detection (VAD) to isolate speech segments, reducing computational load and improving accuracy. Transcription is handled by Faster Whisper (default) or OpenAI's Whisper, with configurable chunking strategies to balance latency and completeness. Its modular design, employing factory and strategy patterns, allows easy integration of alternative VAD and ASR technologies.

Quick Start & Requirements

Server Installation: pip install -r requirements.txt (Python 3.8+)
Docker: Build with sudo docker build -t voicestreamai . and run with sudo docker run --gpus all -p 8765:8765 -e PYANNOTE_AUTH_TOKEN='VAD_TOKEN_HERE' voicestreamai. Requires NVIDIA Container Toolkit for GPU acceleration.
VAD Token: Obtain from Huggingface (https://huggingface.co/pyannote/segmentation).
Client: Open client/index.html in a modern web browser.
Docs: Demo Video

Highlighted Details

Near-realtime transcription via WebSocket.
Modular design supporting custom VAD/ASR components.
Default use of Faster Whisper for speed.
Configurable audio chunking and silence handling.
Supports multilingual transcription.

Maintenance & Community

Project maintained by Alessandro Saccoia.
Open for contributions via pull requests.

Licensing & Compatibility

No license specified in the README. Compatibility for commercial use is undetermined.

Limitations & Caveats

The system's processing strategy ("SilenceAtEndOfChunk") introduces extra latency for dense speech segments as transcription waits for a pause. Whisper's accuracy can vary with small audio chunks due to context loss. The current implementation saves chunks to files before processing.

VoiceStreamAI by alesaccoia

Explore Similar Projects

susi_translator by susiai

whispering-ui by Sharrnah

whisper-at by YuanGongND

S.A.T.U.R.D.A.Y by GRVYDEV

RuntimeSpeechRecognizer by gtreshchev

realtime-transcription-fastrtc by sofdog-gh

voicechat2 by lhl

transcribe-anything by zackees

use-whisper by chengsokdara

stable-ts by jianfch

WhisperLive by collabora

WhisperLiveKit by QuentinFuxa