Speech recognition for verbatim transcription with word-level timestamps
Top 45.4% on sourcepulse
CrisperWhisper is an advanced variant of OpenAI's Whisper, designed for verbatim Automatic Speech Recognition (ASR) with highly accurate word-level timestamps and filler word detection. It targets researchers and developers needing precise temporal segmentation of speech, offering a significant improvement over standard Whisper for applications requiring exact transcription of spoken content, including disfluencies.
How It Works
CrisperWhisper enhances Whisper's capabilities by employing Dynamic Time Warping (DTW) on cross-attention scores, refined by a custom attention loss function. This approach, combined with a retokenization process, allows for precise alignment of tokens to words or pauses, leading to superior word-level timestamp accuracy, especially around disfluencies. The model is trained in stages, first adapting Whisper to a new tokenizer, then fine-tuning on verbatim datasets, and finally incorporating the attention loss to boost timestamp precision.
Quick Start & Requirements
pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model is licensed for non-commercial use only, restricting its application in commercial products. While compatible with transformers
and faster-whisper
, the latter may have reduced timestamp accuracy due to implementation differences.
2 months ago
1 week