whisper-medusa  by aiola-lab

ASR optimization via multi-head decoding

created 1 year ago
850 stars

Top 42.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository introduces Whisper-Medusa, an extension of the Whisper ASR model designed to accelerate inference by predicting multiple tokens per iteration. It targets researchers and developers working with large ASR models who need to improve transcription speed, offering two architectures: Medusa-Linear and Medusa-Block.

How It Works

Whisper-Medusa builds upon the Whisper architecture by adding multiple "Medusa heads" that predict subsequent tokens in parallel. Medusa-Linear uses a single linear layer per head, while Medusa-Block shares a full Whisper decoder block across heads. This multi-head approach allows for faster generation by outputting more tokens per forward pass, with a trade-off in accuracy that is generally minimal.

Quick Start & Requirements

  • Install: Clone the repository and install with pip install -e . after setting up a virtual environment and installing PyTorch with CUDA 11.8 support.
  • Prerequisites: Python 3.11, PyTorch 2.2.2, torchvision 0.17.2, torchaudio 2.2.2, CUDA 11.8.
  • Usage: Inference example provided using WhisperMedusaModel.from_pretrained.
  • Links: Blog, Paper

Highlighted Details

  • Achieves ~1.5x faster generation compared to vanilla Whisper with comparable Word Error Rate (WER).
  • Medusa-Linear offers higher speedup but with slightly degraded WER compared to Medusa-Block.
  • Medusa-Block's WER is between vanilla Whisper and fine-tuned Whisper.
  • Supports training and evaluation pipelines with detailed configuration options.

Maintenance & Community

  • Based on research from aiola-lab.
  • Pretrained models available on Hugging Face Hub.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. However, it is based on Whisper, which is MIT licensed. Compatibility for commercial use should be verified.

Limitations & Caveats

The model is trained on LibriSpeech, potentially limiting robustness to background noise. It is optimized for English audio at a 16kHz sampling rate and currently supports audio files up to 30 seconds.

Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
0
Star History
19 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

Medusa by FasterDecoding

0.2%
3k
Framework for accelerating LLM generation using multiple decoding heads
created 1 year ago
updated 1 year ago
Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
19 more.

whisper by openai

0.4%
86k
Speech recognition model for multilingual transcription/translation
created 2 years ago
updated 1 month ago
Feedback? Help us improve.