whisper by openai

Speech recognition model for multilingual transcription/translation

Created 3 years ago

92,973 stars

Top 0.1% on SourcePulse

View on GitHub

45 Experts Love This Project

Boris Cherny

Creator of Claude Code; MTS at Anthropic

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Gabriel Almeida

Cofounder of Langflow

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

and 41 more!

Project Summary

Whisper is a robust, general-purpose speech recognition model developed by OpenAI. It excels at multilingual transcription, speech translation, and language identification, serving researchers and developers needing high-accuracy audio processing. Its key benefit is a single, unified model that replaces complex, multi-stage traditional pipelines.

How It Works

Whisper employs a Transformer sequence-to-sequence architecture trained on a diverse, large-scale dataset. It unifies various speech tasks (recognition, translation, language ID, voice activity detection) by treating them as token prediction problems for the decoder. Special tokens act as task specifiers, enabling multitask learning within a single model.

Quick Start & Requirements

Install via pip: pip install -U openai-whisper
Requires Python 3.8-3.11, PyTorch, and ffmpeg. Rust may be needed for tiktoken if pre-built wheels are unavailable.
Official docs: https://github.com/openai/whisper
Colab example: https://colab.research.google.com/github/openai/whisper/blob/main/examples/colab.ipynb

Highlighted Details

Offers six model sizes (tiny to large, plus turbo) with varying VRAM and speed trade-offs.
Supports English-only and multilingual transcription, with specific .en models for improved English performance.
Includes a turbo model optimized for speed with minimal accuracy loss.
Benchmarks (WER/CER) available for multiple languages and datasets in the paper.

Maintenance & Community

Developed by OpenAI.
Community discussions available at https://github.com/openai/whisper/discussions.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Model performance, particularly Word Error Rate (WER), varies significantly across languages. The README does not detail specific hardware requirements beyond VRAM estimates for models.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1,229 stars in the last 30 days