ASR optimization via multi-head decoding
Top 42.9% on sourcepulse
This repository introduces Whisper-Medusa, an extension of the Whisper ASR model designed to accelerate inference by predicting multiple tokens per iteration. It targets researchers and developers working with large ASR models who need to improve transcription speed, offering two architectures: Medusa-Linear and Medusa-Block.
How It Works
Whisper-Medusa builds upon the Whisper architecture by adding multiple "Medusa heads" that predict subsequent tokens in parallel. Medusa-Linear uses a single linear layer per head, while Medusa-Block shares a full Whisper decoder block across heads. This multi-head approach allows for faster generation by outputting more tokens per forward pass, with a trade-off in accuracy that is generally minimal.
Quick Start & Requirements
pip install -e .
after setting up a virtual environment and installing PyTorch with CUDA 11.8 support.WhisperMedusaModel.from_pretrained
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model is trained on LibriSpeech, potentially limiting robustness to background noise. It is optimized for English audio at a 16kHz sampling rate and currently supports audio files up to 30 seconds.
3 weeks ago
1 day