whisper-medusa  by aiola-lab

ASR optimization via multi-head decoding

Created 1 year ago
855 stars

Top 41.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository introduces Whisper-Medusa, an extension of the Whisper ASR model designed to accelerate inference by predicting multiple tokens per iteration. It targets researchers and developers working with large ASR models who need to improve transcription speed, offering two architectures: Medusa-Linear and Medusa-Block.

How It Works

Whisper-Medusa builds upon the Whisper architecture by adding multiple "Medusa heads" that predict subsequent tokens in parallel. Medusa-Linear uses a single linear layer per head, while Medusa-Block shares a full Whisper decoder block across heads. This multi-head approach allows for faster generation by outputting more tokens per forward pass, with a trade-off in accuracy that is generally minimal.

Quick Start & Requirements

  • Install: Clone the repository and install with pip install -e . after setting up a virtual environment and installing PyTorch with CUDA 11.8 support.
  • Prerequisites: Python 3.11, PyTorch 2.2.2, torchvision 0.17.2, torchaudio 2.2.2, CUDA 11.8.
  • Usage: Inference example provided using WhisperMedusaModel.from_pretrained.
  • Links: Blog, Paper

Highlighted Details

  • Achieves ~1.5x faster generation compared to vanilla Whisper with comparable Word Error Rate (WER).
  • Medusa-Linear offers higher speedup but with slightly degraded WER compared to Medusa-Block.
  • Medusa-Block's WER is between vanilla Whisper and fine-tuned Whisper.
  • Supports training and evaluation pipelines with detailed configuration options.

Maintenance & Community

  • Based on research from aiola-lab.
  • Pretrained models available on Hugging Face Hub.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. However, it is based on Whisper, which is MIT licensed. Compatibility for commercial use should be verified.

Limitations & Caveats

The model is trained on LibriSpeech, potentially limiting robustness to background noise. It is optimized for English audio at a 16kHz sampling rate and currently supports audio files up to 30 seconds.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
3 more.

prompt-lookup-decoding by apoorvumang

0.2%
566
Decoding method for faster LLM generation
Created 1 year ago
Updated 1 year ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

pyctcdecode by kensho-technologies

0%
460
CTC beam search decoder for speech recognition
Created 4 years ago
Updated 2 years ago
Feedback? Help us improve.