Real-time streaming for long speech-to-text transcription/translation
Top 15.6% on sourcepulse
This repository provides a real-time streaming speech-to-text and translation system built upon OpenAI's Whisper model. It addresses the challenge of Whisper's non-streaming nature for long-form audio, enabling applications like live transcription services and multilingual conference support. The target audience includes developers and researchers working with real-time audio processing and speech recognition.
How It Works
Whisper-Streaming employs a "local agreement policy with self-adaptive latency" to achieve real-time performance. It processes audio in chunks, emitting confirmed transcriptions based on agreement across consecutive updates. This approach allows for dynamic latency adjustment, ensuring high quality and responsiveness even with unsegmented long-form speech.
Quick Start & Requirements
pip install librosa soundfile
faster-whisper
(recommended for GPU, requires CUDA >= 11.7), whisper-timestamped
, openai-api
(no GPU needed, but incurs costs), or mlx-whisper
(for Apple Silicon).pip install torch torchaudio
opus-fast-mosestokenizer
, tokenize_uk
, wtpsplit
).python3 whisper_online.py audio_path --language en
whisper_online.py
serve as full documentation.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 month ago
1 day