CrisperWhisper  by nyrahealth

Speech recognition for verbatim transcription with word-level timestamps

Created 1 year ago
857 stars

Top 41.8% on SourcePulse

GitHubView on GitHub
Project Summary

CrisperWhisper is an advanced variant of OpenAI's Whisper, designed for verbatim Automatic Speech Recognition (ASR) with highly accurate word-level timestamps and filler word detection. It targets researchers and developers needing precise temporal segmentation of speech, offering a significant improvement over standard Whisper for applications requiring exact transcription of spoken content, including disfluencies.

How It Works

CrisperWhisper enhances Whisper's capabilities by employing Dynamic Time Warping (DTW) on cross-attention scores, refined by a custom attention loss function. This approach, combined with a retokenization process, allows for precise alignment of tokens to words or pauses, leading to superior word-level timestamp accuracy, especially around disfluencies. The model is trained in stages, first adapting Whisper to a new tokenizer, then fine-tuning on verbatim datasets, and finally incorporating the attention loss to boost timestamp precision.

Quick Start & Requirements

  • Install: pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper
  • Prerequisites: Python 3.10, PyTorch 2.0, NVIDIA Libraries (cuBLAS 11.x, cuDNN 8.x), ffmpeg, rust. Hugging Face account and token required for model download.
  • Setup: Clone repository, create Conda environment, install dependencies.
  • Docs: OpenAI Whisper Setup

Highlighted Details

  • Achieved 1st place on the OpenASR Leaderboard for verbatim datasets (TED, AMI) and overall.
  • Accepted at INTERSPEECH 2024.
  • Demonstrates significant Word Error Rate (WER) reduction compared to Whisper Large v3 on verbatim datasets (e.g., AMI: 8.72 vs 16.01).
  • Offers improved segmentation performance, with higher F1 scores and Avg. IOU on datasets like AMI IHM and Common Voice.

Maintenance & Community

  • The project is associated with nyrahealth.
  • Paper available for detailed methodology.

Licensing & Compatibility

  • Licensed under Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
  • Non-commercial use only.

Limitations & Caveats

The model is licensed for non-commercial use only, restricting its application in commercial products. While compatible with transformers and faster-whisper, the latter may have reduced timestamp accuracy due to implementation differences.

Health Check
Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
29 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

delayed-streams-modeling by kyutai-labs

0.6%
3k
Streaming multimodal sequence-to-sequence learning
Created 4 months ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.