distil-whisper  by huggingface

Distilled speech recognition model, a faster Whisper variant

created 1 year ago
3,926 stars

Top 12.7% on sourcepulse

GitHubView on GitHub
Project Summary

Distil-Whisper offers a distilled English-only speech recognition model that is significantly faster and smaller than the original Whisper, while maintaining comparable accuracy. It is designed for users needing efficient speech-to-text capabilities, from researchers to developers integrating ASR into applications.

How It Works

Distil-Whisper employs knowledge distillation, retaining Whisper's encoder but using only two decoder layers. This reduced architecture is trained to mimic Whisper's output on a large, diverse dataset of pseudo-labeled audio, minimizing KL divergence and cross-entropy loss. This approach yields a model that is 6x faster and 49% smaller with a negligible impact on Word Error Rate.

Quick Start & Requirements

  • Install via pip: pip install --upgrade transformers accelerate datasets[audio]
  • Requires Python and a CUDA-enabled GPU for optimal performance.
  • Official Colab notebook available for benchmarking: Colab

Highlighted Details

  • 6x faster inference, 50% smaller model size.
  • Within 1% Word Error Rate (WER) of original Whisper on out-of-distribution data.
  • Robust to noise and hallucinations, with fewer repeated word duplicates and lower insertion error rates.
  • Supports speculative decoding with the original Whisper model for 2x speedup with guaranteed identical outputs.
  • Compatible with Hugging Face Transformers library (v4.35+).

Maintenance & Community

  • Developed by Hugging Face.
  • Integrations with other libraries like Whisper cpp, Transformers.js, and Candle are provided.
  • Training code is available for community use and adaptation to other languages.

Licensing & Compatibility

  • MIT License, permitting commercial use.
  • Compatible with various libraries for exporting and integration.

Limitations & Caveats

  • Currently supports English speech recognition only; multilingual support is available via Whisper Turbo.
  • Long-form transcription requires specific algorithms (sequential or chunked) for optimal performance.
Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
94 stars in the last 90 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
19 more.

whisper by openai

0.4%
86k
Speech recognition model for multilingual transcription/translation
created 2 years ago
updated 1 month ago
Feedback? Help us improve.