CrisperWhisper  by nyrahealth

Speech recognition for verbatim transcription with word-level timestamps

created 1 year ago
787 stars

Top 45.4% on sourcepulse

GitHubView on GitHub
Project Summary

CrisperWhisper is an advanced variant of OpenAI's Whisper, designed for verbatim Automatic Speech Recognition (ASR) with highly accurate word-level timestamps and filler word detection. It targets researchers and developers needing precise temporal segmentation of speech, offering a significant improvement over standard Whisper for applications requiring exact transcription of spoken content, including disfluencies.

How It Works

CrisperWhisper enhances Whisper's capabilities by employing Dynamic Time Warping (DTW) on cross-attention scores, refined by a custom attention loss function. This approach, combined with a retokenization process, allows for precise alignment of tokens to words or pauses, leading to superior word-level timestamp accuracy, especially around disfluencies. The model is trained in stages, first adapting Whisper to a new tokenizer, then fine-tuning on verbatim datasets, and finally incorporating the attention loss to boost timestamp precision.

Quick Start & Requirements

  • Install: pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper
  • Prerequisites: Python 3.10, PyTorch 2.0, NVIDIA Libraries (cuBLAS 11.x, cuDNN 8.x), ffmpeg, rust. Hugging Face account and token required for model download.
  • Setup: Clone repository, create Conda environment, install dependencies.
  • Docs: OpenAI Whisper Setup

Highlighted Details

  • Achieved 1st place on the OpenASR Leaderboard for verbatim datasets (TED, AMI) and overall.
  • Accepted at INTERSPEECH 2024.
  • Demonstrates significant Word Error Rate (WER) reduction compared to Whisper Large v3 on verbatim datasets (e.g., AMI: 8.72 vs 16.01).
  • Offers improved segmentation performance, with higher F1 scores and Avg. IOU on datasets like AMI IHM and Common Voice.

Maintenance & Community

  • The project is associated with nyrahealth.
  • Paper available for detailed methodology.

Licensing & Compatibility

  • Licensed under Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
  • Non-commercial use only.

Limitations & Caveats

The model is licensed for non-commercial use only, restricting its application in commercial products. While compatible with transformers and faster-whisper, the latter may have reduced timestamp accuracy due to implementation differences.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
3
Star History
107 stars in the last 90 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
19 more.

whisper by openai

0.4%
86k
Speech recognition model for multilingual transcription/translation
created 2 years ago
updated 1 month ago
Feedback? Help us improve.