whisperX  by m-bain

ASR tool for accurate, batched, word-level Whisper transcriptions

created 2 years ago
17,056 stars

Top 2.7% on sourcepulse

GitHubView on GitHub
Project Summary

WhisperX enhances OpenAI's Whisper ASR by providing word-level timestamps and speaker diarization, enabling highly accurate, time-stamped transcriptions for long-form audio. It targets researchers and developers needing precise speech analysis, offering up to 70x faster transcription speeds via a faster-whisper backend and accurate alignment with wav2vec2.

How It Works

WhisperX leverages a faster-whisper backend for batched inference, significantly boosting transcription speed. Word-level timestamps are achieved through forced alignment with wav2vec2 models. Speaker diarization is integrated using pyannote-audio, allowing for multi-speaker identification. Voice Activity Detection (VAD) preprocessing is employed to reduce hallucinations and improve batching efficiency without degrading Word Error Rate (WER).

Quick Start & Requirements

  • Installation: pip install whisperx or uvx whisperx
  • Prerequisites: CUDA (for GPU acceleration), ffmpeg, rust. Hugging Face access token required for speaker diarization.
  • Resources: Requires <8GB GPU memory for large-v2 with beam_size=5.
  • Docs: Setup, WhisperX

Highlighted Details

  • Achieves 70x realtime transcription speed with large-v2 and faster-whisper.
  • Integrates speaker diarization from pyannote-audio.
  • Provides accurate word-level timestamps via wav2vec2 alignment.
  • VAD preprocessing is enabled by default for improved accuracy and batching.

Maintenance & Community

  • Project accepted at INTERSPEECH 2023.
  • Contact: maxhbain@gmail.com
  • Supported by VGG and the University of Oxford.

Licensing & Compatibility

  • Primarily MIT license.
  • Compatible with commercial use.

Limitations & Caveats

Words with non-alphanumeric characters (e.g., "2014.", "£13.60") may not receive timestamps due to alignment model limitations. Overlapping speech is not optimally handled. A known issue exists with slow performance for pyannote/Speaker-Diarization-3.0 due to dependency conflicts.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
10
Star History
1,866 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.