distil-whisper by huggingface

Distilled speech recognition model, a faster Whisper variant

Created 2 years ago

4,018 stars

Top 12.1% on SourcePulse

View on GitHub

8 Experts Love This Project

Clement Delangue

Cofounder of Hugging Face

Travis Fischer

Founder of Agentic

Luis Capelo

Cofounder of Lightning AI

Thomas Wolf

Cofounder of Hugging Face

and 4 more!

Project Summary

Distil-Whisper offers a distilled English-only speech recognition model that is significantly faster and smaller than the original Whisper, while maintaining comparable accuracy. It is designed for users needing efficient speech-to-text capabilities, from researchers to developers integrating ASR into applications.

How It Works

Distil-Whisper employs knowledge distillation, retaining Whisper's encoder but using only two decoder layers. This reduced architecture is trained to mimic Whisper's output on a large, diverse dataset of pseudo-labeled audio, minimizing KL divergence and cross-entropy loss. This approach yields a model that is 6x faster and 49% smaller with a negligible impact on Word Error Rate.

Quick Start & Requirements

Install via pip: pip install --upgrade transformers accelerate datasets[audio]
Requires Python and a CUDA-enabled GPU for optimal performance.
Official Colab notebook available for benchmarking: Colab

Highlighted Details

6x faster inference, 50% smaller model size.
Within 1% Word Error Rate (WER) of original Whisper on out-of-distribution data.
Robust to noise and hallucinations, with fewer repeated word duplicates and lower insertion error rates.
Supports speculative decoding with the original Whisper model for 2x speedup with guaranteed identical outputs.
Compatible with Hugging Face Transformers library (v4.35+).

Maintenance & Community

Developed by Hugging Face.
Integrations with other libraries like Whisper cpp, Transformers.js, and Candle are provided.
Training code is available for community use and adaptation to other languages.

Licensing & Compatibility

MIT License, permitting commercial use.
Compatible with various libraries for exporting and integration.

Limitations & Caveats

Currently supports English speech recognition only; multilingual support is available via Whisper Turbo.
Long-form transcription requires specific algorithms (sequential or chunked) for optimal performance.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days