CrisperWhisper by nyrahealth

Speech recognition for verbatim transcription with word-level timestamps

Created 1 year ago

885 stars

Top 40.8% on SourcePulse

Project Summary

CrisperWhisper is an advanced variant of OpenAI's Whisper, designed for verbatim Automatic Speech Recognition (ASR) with highly accurate word-level timestamps and filler word detection. It targets researchers and developers needing precise temporal segmentation of speech, offering a significant improvement over standard Whisper for applications requiring exact transcription of spoken content, including disfluencies.

How It Works

CrisperWhisper enhances Whisper's capabilities by employing Dynamic Time Warping (DTW) on cross-attention scores, refined by a custom attention loss function. This approach, combined with a retokenization process, allows for precise alignment of tokens to words or pauses, leading to superior word-level timestamp accuracy, especially around disfluencies. The model is trained in stages, first adapting Whisper to a new tokenizer, then fine-tuning on verbatim datasets, and finally incorporating the attention loss to boost timestamp precision.

Quick Start & Requirements

Install: pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper
Prerequisites: Python 3.10, PyTorch 2.0, NVIDIA Libraries (cuBLAS 11.x, cuDNN 8.x), ffmpeg, rust. Hugging Face account and token required for model download.
Setup: Clone repository, create Conda environment, install dependencies.
Docs: OpenAI Whisper Setup

Highlighted Details

Achieved 1st place on the OpenASR Leaderboard for verbatim datasets (TED, AMI) and overall.
Accepted at INTERSPEECH 2024.
Demonstrates significant Word Error Rate (WER) reduction compared to Whisper Large v3 on verbatim datasets (e.g., AMI: 8.72 vs 16.01).
Offers improved segmentation performance, with higher F1 scores and Avg. IOU on datasets like AMI IHM and Common Voice.

Maintenance & Community

The project is associated with nyrahealth.
Paper available for detailed methodology.

Licensing & Compatibility

Licensed under Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Non-commercial use only.

Limitations & Caveats

The model is licensed for non-commercial use only, restricting its application in commercial products. While compatible with transformers and faster-whisper, the latter may have reduced timestamp accuracy due to implementation differences.

CrisperWhisper by nyrahealth

Explore Similar Projects

insanely-fast-whisper-cli by ochen1

pytvzhen by CuSO4Gem

transcriber_app by davabase

fast-voice-assistant by dsa

whisper-plus by kadirnar

whisper-timestamped by linto-ai

easyVoice by cosin2077

delayed-streams-modeling by kyutai-labs

Orpheus-TTS by canopyai

metavoice-src by metavoiceio

VoiceCraft by jasonppy

whisperX by m-bain