whisper_streaming  by ufal

Real-time streaming for long speech-to-text transcription/translation

created 2 years ago
3,161 stars

Top 15.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a real-time streaming speech-to-text and translation system built upon OpenAI's Whisper model. It addresses the challenge of Whisper's non-streaming nature for long-form audio, enabling applications like live transcription services and multilingual conference support. The target audience includes developers and researchers working with real-time audio processing and speech recognition.

How It Works

Whisper-Streaming employs a "local agreement policy with self-adaptive latency" to achieve real-time performance. It processes audio in chunks, emitting confirmed transcriptions based on agreement across consecutive updates. This approach allows for dynamic latency adjustment, ensuring high quality and responsiveness even with unsegmented long-form speech.

Quick Start & Requirements

  • Installation: pip install librosa soundfile
  • Whisper Backend: Requires installation of a backend like faster-whisper (recommended for GPU, requires CUDA >= 11.7), whisper-timestamped, openai-api (no GPU needed, but incurs costs), or mlx-whisper (for Apple Silicon).
  • Voice Activity Controller (Optional but Recommended): pip install torch torchaudio
  • Sentence Segmenter (Optional): Required for "sentence" buffer trimming; installation varies by language (e.g., opus-fast-mosestokenizer, tokenize_uk, wtpsplit).
  • Usage Example: python3 whisper_online.py audio_path --language en
  • Documentation: Code comments in whisper_online.py serve as full documentation.

Highlighted Details

  • Achieves 3.3 seconds latency on long-form speech transcription.
  • Supports transcription and translation tasks.
  • Integrates Voice Activity Detection (VAD) and a Voice Activity Controller (VAC).
  • Offers multiple buffer trimming strategies ("segment" and "sentence").

Maintenance & Community

  • Contributions are welcome.
  • Credits include Peter Polák for the original idea and the Silero Team for their VAD model.
  • Contact: Dominik Macháček, machacek@ufal.mff.cuni.cz.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the provided README. However, it relies on Whisper, which is typically released under the MIT license. Backend dependencies may have their own licenses.

Limitations & Caveats

  • The "sentence" buffer trimming option requires installing language-specific sentence segmenters, which can be complex and may not be available for all supported Whisper languages.
  • Using the OpenAI API backend incurs costs and requires careful monitoring.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
371 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.