distil-whisper  by huggingface

Distilled speech recognition model, a faster Whisper variant

Created 1 year ago
3,947 stars

Top 12.4% on SourcePulse

GitHubView on GitHub
Project Summary

Distil-Whisper offers a distilled English-only speech recognition model that is significantly faster and smaller than the original Whisper, while maintaining comparable accuracy. It is designed for users needing efficient speech-to-text capabilities, from researchers to developers integrating ASR into applications.

How It Works

Distil-Whisper employs knowledge distillation, retaining Whisper's encoder but using only two decoder layers. This reduced architecture is trained to mimic Whisper's output on a large, diverse dataset of pseudo-labeled audio, minimizing KL divergence and cross-entropy loss. This approach yields a model that is 6x faster and 49% smaller with a negligible impact on Word Error Rate.

Quick Start & Requirements

  • Install via pip: pip install --upgrade transformers accelerate datasets[audio]
  • Requires Python and a CUDA-enabled GPU for optimal performance.
  • Official Colab notebook available for benchmarking: Colab

Highlighted Details

  • 6x faster inference, 50% smaller model size.
  • Within 1% Word Error Rate (WER) of original Whisper on out-of-distribution data.
  • Robust to noise and hallucinations, with fewer repeated word duplicates and lower insertion error rates.
  • Supports speculative decoding with the original Whisper model for 2x speedup with guaranteed identical outputs.
  • Compatible with Hugging Face Transformers library (v4.35+).

Maintenance & Community

  • Developed by Hugging Face.
  • Integrations with other libraries like Whisper cpp, Transformers.js, and Candle are provided.
  • Training code is available for community use and adaptation to other languages.

Licensing & Compatibility

  • MIT License, permitting commercial use.
  • Compatible with various libraries for exporting and integration.

Limitations & Caveats

  • Currently supports English speech recognition only; multilingual support is available via Whisper Turbo.
  • Long-form transcription requires specific algorithms (sequential or chunked) for optimal performance.
Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.2%
9k
End-to-end speech processing toolkit for various speech tasks
Created 7 years ago
Updated 3 days ago
Feedback? Help us improve.