CrisperWhisper  by nyrahealth

Speech recognition for verbatim transcription with word-level timestamps

Created 1 year ago
815 stars

Top 43.5% on SourcePulse

GitHubView on GitHub
Project Summary

CrisperWhisper is an advanced variant of OpenAI's Whisper, designed for verbatim Automatic Speech Recognition (ASR) with highly accurate word-level timestamps and filler word detection. It targets researchers and developers needing precise temporal segmentation of speech, offering a significant improvement over standard Whisper for applications requiring exact transcription of spoken content, including disfluencies.

How It Works

CrisperWhisper enhances Whisper's capabilities by employing Dynamic Time Warping (DTW) on cross-attention scores, refined by a custom attention loss function. This approach, combined with a retokenization process, allows for precise alignment of tokens to words or pauses, leading to superior word-level timestamp accuracy, especially around disfluencies. The model is trained in stages, first adapting Whisper to a new tokenizer, then fine-tuning on verbatim datasets, and finally incorporating the attention loss to boost timestamp precision.

Quick Start & Requirements

  • Install: pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper
  • Prerequisites: Python 3.10, PyTorch 2.0, NVIDIA Libraries (cuBLAS 11.x, cuDNN 8.x), ffmpeg, rust. Hugging Face account and token required for model download.
  • Setup: Clone repository, create Conda environment, install dependencies.
  • Docs: OpenAI Whisper Setup

Highlighted Details

  • Achieved 1st place on the OpenASR Leaderboard for verbatim datasets (TED, AMI) and overall.
  • Accepted at INTERSPEECH 2024.
  • Demonstrates significant Word Error Rate (WER) reduction compared to Whisper Large v3 on verbatim datasets (e.g., AMI: 8.72 vs 16.01).
  • Offers improved segmentation performance, with higher F1 scores and Avg. IOU on datasets like AMI IHM and Common Voice.

Maintenance & Community

  • The project is associated with nyrahealth.
  • Paper available for detailed methodology.

Licensing & Compatibility

  • Licensed under Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
  • Non-commercial use only.

Limitations & Caveats

The model is licensed for non-commercial use only, restricting its application in commercial products. While compatible with transformers and faster-whisper, the latter may have reduced timestamp accuracy due to implementation differences.

Health Check
Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.