whisper_streaming  by ufal

Real-time streaming for long speech-to-text transcription/translation

Created 2 years ago
3,334 stars

Top 14.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a real-time streaming speech-to-text and translation system built upon OpenAI's Whisper model. It addresses the challenge of Whisper's non-streaming nature for long-form audio, enabling applications like live transcription services and multilingual conference support. The target audience includes developers and researchers working with real-time audio processing and speech recognition.

How It Works

Whisper-Streaming employs a "local agreement policy with self-adaptive latency" to achieve real-time performance. It processes audio in chunks, emitting confirmed transcriptions based on agreement across consecutive updates. This approach allows for dynamic latency adjustment, ensuring high quality and responsiveness even with unsegmented long-form speech.

Quick Start & Requirements

  • Installation: pip install librosa soundfile
  • Whisper Backend: Requires installation of a backend like faster-whisper (recommended for GPU, requires CUDA >= 11.7), whisper-timestamped, openai-api (no GPU needed, but incurs costs), or mlx-whisper (for Apple Silicon).
  • Voice Activity Controller (Optional but Recommended): pip install torch torchaudio
  • Sentence Segmenter (Optional): Required for "sentence" buffer trimming; installation varies by language (e.g., opus-fast-mosestokenizer, tokenize_uk, wtpsplit).
  • Usage Example: python3 whisper_online.py audio_path --language en
  • Documentation: Code comments in whisper_online.py serve as full documentation.

Highlighted Details

  • Achieves 3.3 seconds latency on long-form speech transcription.
  • Supports transcription and translation tasks.
  • Integrates Voice Activity Detection (VAD) and a Voice Activity Controller (VAC).
  • Offers multiple buffer trimming strategies ("segment" and "sentence").

Maintenance & Community

  • Contributions are welcome.
  • Credits include Peter Polák for the original idea and the Silero Team for their VAD model.
  • Contact: Dominik Macháček, machacek@ufal.mff.cuni.cz.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the provided README. However, it relies on Whisper, which is typically released under the MIT license. Backend dependencies may have their own licenses.

Limitations & Caveats

  • The "sentence" buffer trimming option requires installing language-specific sentence segmenters, which can be complex and may not be available for all supported Whisper languages.
  • Using the OpenAI API backend incurs costs and requires careful monitoring.
Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
113 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Travis Fischer Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

0.5%
9k
Speech-to-text library for realtime applications
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.