whisper  by openai

Speech recognition model for multilingual transcription/translation

Created 3 years ago
88,303 stars

Top 0.1% on SourcePulse

GitHubView on GitHub
Project Summary

Whisper is a robust, general-purpose speech recognition model developed by OpenAI. It excels at multilingual transcription, speech translation, and language identification, serving researchers and developers needing high-accuracy audio processing. Its key benefit is a single, unified model that replaces complex, multi-stage traditional pipelines.

How It Works

Whisper employs a Transformer sequence-to-sequence architecture trained on a diverse, large-scale dataset. It unifies various speech tasks (recognition, translation, language ID, voice activity detection) by treating them as token prediction problems for the decoder. Special tokens act as task specifiers, enabling multitask learning within a single model.

Quick Start & Requirements

Highlighted Details

  • Offers six model sizes (tiny to large, plus turbo) with varying VRAM and speed trade-offs.
  • Supports English-only and multilingual transcription, with specific .en models for improved English performance.
  • Includes a turbo model optimized for speed with minimal accuracy loss.
  • Benchmarks (WER/CER) available for multiple languages and datasets in the paper.

Maintenance & Community

Licensing & Compatibility

  • Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Model performance, particularly Word Error Rate (WER), varies significantly across languages. The README does not detail specific hardware requirements beyond VRAM estimates for models.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
0
Star History
1,569 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.2%
9k
End-to-end speech processing toolkit for various speech tasks
Created 7 years ago
Updated 3 days ago
Feedback? Help us improve.