whisper  by openai

Speech recognition model for multilingual transcription/translation

created 2 years ago
85,828 stars

Top 0.1% on sourcepulse

GitHubView on GitHub
Project Summary

Whisper is a robust, general-purpose speech recognition model developed by OpenAI. It excels at multilingual transcription, speech translation, and language identification, serving researchers and developers needing high-accuracy audio processing. Its key benefit is a single, unified model that replaces complex, multi-stage traditional pipelines.

How It Works

Whisper employs a Transformer sequence-to-sequence architecture trained on a diverse, large-scale dataset. It unifies various speech tasks (recognition, translation, language ID, voice activity detection) by treating them as token prediction problems for the decoder. Special tokens act as task specifiers, enabling multitask learning within a single model.

Quick Start & Requirements

Highlighted Details

  • Offers six model sizes (tiny to large, plus turbo) with varying VRAM and speed trade-offs.
  • Supports English-only and multilingual transcription, with specific .en models for improved English performance.
  • Includes a turbo model optimized for speed with minimal accuracy loss.
  • Benchmarks (WER/CER) available for multiple languages and datasets in the paper.

Maintenance & Community

Licensing & Compatibility

  • Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Model performance, particularly Word Error Rate (WER), varies significantly across languages. The README does not detail specific hardware requirements beyond VRAM estimates for models.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
7
Issues (30d)
0
Star History
5,287 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.