whisperX by m-bain

ASR tool for accurate, batched, word-level Whisper transcriptions

Created 3 years ago

19,543 stars

Top 2.3% on SourcePulse

View on GitHub

11 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Paul Gauthier

Founder of Aider

Nicolae Rusan

Cofounder of Magnet, Clay

Kyle Mathews

Author of Gatsby

and 7 more!

Project Summary

WhisperX enhances OpenAI's Whisper ASR by providing word-level timestamps and speaker diarization, enabling highly accurate, time-stamped transcriptions for long-form audio. It targets researchers and developers needing precise speech analysis, offering up to 70x faster transcription speeds via a faster-whisper backend and accurate alignment with wav2vec2.

How It Works

WhisperX leverages a faster-whisper backend for batched inference, significantly boosting transcription speed. Word-level timestamps are achieved through forced alignment with wav2vec2 models. Speaker diarization is integrated using pyannote-audio, allowing for multi-speaker identification. Voice Activity Detection (VAD) preprocessing is employed to reduce hallucinations and improve batching efficiency without degrading Word Error Rate (WER).

Quick Start & Requirements

Installation: pip install whisperx or uvx whisperx
Prerequisites: CUDA (for GPU acceleration), ffmpeg, rust. Hugging Face access token required for speaker diarization.
Resources: Requires <8GB GPU memory for large-v2 with beam_size=5.
Docs: Setup, WhisperX

Highlighted Details

Achieves 70x realtime transcription speed with large-v2 and faster-whisper.
Integrates speaker diarization from pyannote-audio.
Provides accurate word-level timestamps via wav2vec2 alignment.
VAD preprocessing is enabled by default for improved accuracy and batching.

Maintenance & Community

Project accepted at INTERSPEECH 2023.
Contact: maxhbain@gmail.com
Supported by VGG and the University of Oxford.

Licensing & Compatibility

Primarily MIT license.
Compatible with commercial use.

Limitations & Caveats

Words with non-alphanumeric characters (e.g., "2014.", "£13.60") may not receive timestamps due to alignment model limitations. Overlapping speech is not optimally handled. A known issue exists with slow performance for pyannote/Speaker-Diarization-3.0 due to dependency conflicts.

Health Check

Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

429 stars in the last 30 days