VoiceStreamAI  by alesaccoia

Real-time audio transcription server using self-hosted Whisper

created 1 year ago
898 stars

Top 41.2% on sourcepulse

GitHubView on GitHub
Project Summary

VoiceStreamAI provides a near-realtime audio transcription solution using self-hosted Whisper and WebSocket. It targets developers and users needing to transcribe streaming audio with customizable VAD and ASR components, offering a modular Python server and JavaScript client. The primary benefit is efficient, accurate, and low-latency speech-to-text for various applications.

How It Works

The system leverages WebSocket for real-time audio streaming between a JavaScript client and a Python server. The server utilizes Huggingface's Voice Activity Detection (VAD) to isolate speech segments, reducing computational load and improving accuracy. Transcription is handled by Faster Whisper (default) or OpenAI's Whisper, with configurable chunking strategies to balance latency and completeness. Its modular design, employing factory and strategy patterns, allows easy integration of alternative VAD and ASR technologies.

Quick Start & Requirements

  • Server Installation: pip install -r requirements.txt (Python 3.8+)
  • Docker: Build with sudo docker build -t voicestreamai . and run with sudo docker run --gpus all -p 8765:8765 -e PYANNOTE_AUTH_TOKEN='VAD_TOKEN_HERE' voicestreamai. Requires NVIDIA Container Toolkit for GPU acceleration.
  • VAD Token: Obtain from Huggingface (https://huggingface.co/pyannote/segmentation).
  • Client: Open client/index.html in a modern web browser.
  • Docs: Demo Video

Highlighted Details

  • Near-realtime transcription via WebSocket.
  • Modular design supporting custom VAD/ASR components.
  • Default use of Faster Whisper for speed.
  • Configurable audio chunking and silence handling.
  • Supports multilingual transcription.

Maintenance & Community

  • Project maintained by Alessandro Saccoia.
  • Open for contributions via pull requests.

Licensing & Compatibility

  • No license specified in the README. Compatibility for commercial use is undetermined.

Limitations & Caveats

The system's processing strategy ("SilenceAtEndOfChunk") introduces extra latency for dense speech segments as transcription waits for a pause. Whisper's accuracy can vary with small audio chunks due to context loss. The current implementation saves chunks to files before processing.

Health Check
Last commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
40 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.