ASR tool for word-level timestamps and confidence scores using Whisper
Top 18.9% on sourcepulse
This project provides word-level timestamps and confidence scores for multilingual Automatic Speech Recognition (ASR) using OpenAI's Whisper models. It addresses the limitation of Whisper's segment-level timestamps, offering a more granular and accurate transcription for researchers and developers working with speech data.
How It Works
The core innovation lies in using Dynamic Time Warping (DTW) on Whisper's cross-attention weights to derive word-level alignments. This approach avoids the need for language-specific models or character normalization required by other methods, and it performs alignment on-the-fly without additional inference steps, optimizing memory usage for long audio files.
Quick Start & Requirements
pip3 install whisper-timestamped
matplotlib
, torchaudio
, onnxruntime
for VAD, transformers
for Hugging Face models.Highlighted Details
Maintenance & Community
openai-whisper
(MIT) and dtw-python
(GPL v3).Licensing & Compatibility
openai-whisper
(MIT) and dtw-python
(GPL v3). The GPL v3 license of dtw-python
may impose copyleft restrictions on derivative works.Limitations & Caveats
4 months ago
1 day