Discover and explore top open-source AI tools and projects—updated daily.
m-bainASR tool for accurate, batched, word-level Whisper transcriptions
Top 2.4% on SourcePulse
WhisperX enhances OpenAI's Whisper ASR by providing word-level timestamps and speaker diarization, enabling highly accurate, time-stamped transcriptions for long-form audio. It targets researchers and developers needing precise speech analysis, offering up to 70x faster transcription speeds via a faster-whisper backend and accurate alignment with wav2vec2.
How It Works
WhisperX leverages a faster-whisper backend for batched inference, significantly boosting transcription speed. Word-level timestamps are achieved through forced alignment with wav2vec2 models. Speaker diarization is integrated using pyannote-audio, allowing for multi-speaker identification. Voice Activity Detection (VAD) preprocessing is employed to reduce hallucinations and improve batching efficiency without degrading Word Error Rate (WER).
Quick Start & Requirements
pip install whisperx or uvx whisperxffmpeg, rust. Hugging Face access token required for speaker diarization.large-v2 with beam_size=5.Highlighted Details
large-v2 and faster-whisper.pyannote-audio.wav2vec2 alignment.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Words with non-alphanumeric characters (e.g., "2014.", "£13.60") may not receive timestamps due to alignment model limitations. Overlapping speech is not optimally handled. A known issue exists with slow performance for pyannote/Speaker-Diarization-3.0 due to dependency conflicts.
1 week ago
1 week
WhisperSpeech
KoljaB