whisperX  by m-bain

ASR tool for accurate, batched, word-level Whisper transcriptions

Created 2 years ago
17,777 stars

Top 2.6% on SourcePulse

GitHubView on GitHub
Project Summary

WhisperX enhances OpenAI's Whisper ASR by providing word-level timestamps and speaker diarization, enabling highly accurate, time-stamped transcriptions for long-form audio. It targets researchers and developers needing precise speech analysis, offering up to 70x faster transcription speeds via a faster-whisper backend and accurate alignment with wav2vec2.

How It Works

WhisperX leverages a faster-whisper backend for batched inference, significantly boosting transcription speed. Word-level timestamps are achieved through forced alignment with wav2vec2 models. Speaker diarization is integrated using pyannote-audio, allowing for multi-speaker identification. Voice Activity Detection (VAD) preprocessing is employed to reduce hallucinations and improve batching efficiency without degrading Word Error Rate (WER).

Quick Start & Requirements

  • Installation: pip install whisperx or uvx whisperx
  • Prerequisites: CUDA (for GPU acceleration), ffmpeg, rust. Hugging Face access token required for speaker diarization.
  • Resources: Requires <8GB GPU memory for large-v2 with beam_size=5.
  • Docs: Setup, WhisperX

Highlighted Details

  • Achieves 70x realtime transcription speed with large-v2 and faster-whisper.
  • Integrates speaker diarization from pyannote-audio.
  • Provides accurate word-level timestamps via wav2vec2 alignment.
  • VAD preprocessing is enabled by default for improved accuracy and batching.

Maintenance & Community

  • Project accepted at INTERSPEECH 2023.
  • Contact: maxhbain@gmail.com
  • Supported by VGG and the University of Oxford.

Licensing & Compatibility

  • Primarily MIT license.
  • Compatible with commercial use.

Limitations & Caveats

Words with non-alphanumeric characters (e.g., "2014.", "£13.60") may not receive timestamps due to alignment model limitations. Overlapping speech is not optimally handled. A known issue exists with slow performance for pyannote/Speaker-Diarization-3.0 due to dependency conflicts.

Health Check
Last Commit

2 months ago

Responsiveness

1+ week

Pull Requests (30d)
4
Issues (30d)
16
Star History
438 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Travis Fischer Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

0.5%
9k
Speech-to-text library for realtime applications
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.