whisper-medusa by aiola-lab

ASR optimization via multi-head decoding

Created 1 year ago

864 stars

Top 41.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeremy Howard

Cofounder of fast.ai

Project Summary

This repository introduces Whisper-Medusa, an extension of the Whisper ASR model designed to accelerate inference by predicting multiple tokens per iteration. It targets researchers and developers working with large ASR models who need to improve transcription speed, offering two architectures: Medusa-Linear and Medusa-Block.

How It Works

Whisper-Medusa builds upon the Whisper architecture by adding multiple "Medusa heads" that predict subsequent tokens in parallel. Medusa-Linear uses a single linear layer per head, while Medusa-Block shares a full Whisper decoder block across heads. This multi-head approach allows for faster generation by outputting more tokens per forward pass, with a trade-off in accuracy that is generally minimal.

Quick Start & Requirements

Install: Clone the repository and install with pip install -e . after setting up a virtual environment and installing PyTorch with CUDA 11.8 support.
Prerequisites: Python 3.11, PyTorch 2.2.2, torchvision 0.17.2, torchaudio 2.2.2, CUDA 11.8.
Usage: Inference example provided using WhisperMedusaModel.from_pretrained.
Links: Blog, Paper

Highlighted Details

Achieves ~1.5x faster generation compared to vanilla Whisper with comparable Word Error Rate (WER).
Medusa-Linear offers higher speedup but with slightly degraded WER compared to Medusa-Block.
Medusa-Block's WER is between vanilla Whisper and fine-tuned Whisper.
Supports training and evaluation pipelines with detailed configuration options.

Maintenance & Community

Based on research from aiola-lab.
Pretrained models available on Hugging Face Hub.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it is based on Whisper, which is MIT licensed. Compatibility for commercial use should be verified.

Limitations & Caveats

The model is trained on LibriSpeech, potentially limiting robustness to background noise. It is optimized for English audio at a 16kHz sampling rate and currently supports audio files up to 30 seconds.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days