Whisper  by Const-me

GPGPU inference for OpenAI's Whisper ASR model

created 2 years ago
9,555 stars

Top 5.3% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a high-performance, vendor-agnostic GPGPU inference engine for OpenAI's Whisper ASR model, specifically targeting Windows users. It offers a significantly faster and more lightweight alternative to Python-based implementations for transcribing audio files or live microphone input.

How It Works

The core of the project is a C++ implementation leveraging DirectCompute (Direct3D 11 compute shaders) for GPU acceleration. This approach avoids heavy runtime dependencies like PyTorch and CUDA, resulting in a much smaller footprint. It utilizes mixed F16/F32 precision and includes a built-in profiler for shader execution times. Media Foundation is used for broad audio format and capture device support.

Quick Start & Requirements

  • Download WhisperDesktop.zip from the Releases section and run WhisperDesktop.exe.
  • Requires a Direct3D 11.0 capable GPU (any hardware GPU from 2011 onwards).
  • CPU requires AVX1 and F16C support.
  • Tested on Windows 10 (should work on 8.1+).
  • Official quick-start and download links are available in the README.

Highlighted Details

  • Achieves 2.2x to 10.6x speedup over PyTorch/CUDA implementations on various GPUs.
  • Minimal runtime dependencies (431 KB vs. 9.63 GB for PyTorch/CUDA).
  • Supports live audio capture with voice activity detection.
  • Includes a COM-style API with an idiomatic C# wrapper available via NuGet.
  • PowerShell 5.1 scripting support introduced in v1.10.
  • RenderDoc GPU debugger integration is included.

Maintenance & Community

This appears to be a personal hobby project, with no specific mention of contributors, sponsorships, or community channels like Discord/Slack.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

Automatic language detection is not implemented. Real-time audio capture exhibits high latency (5-10 seconds) due to voice detection and audio chunking. Performance on discrete AMD or integrated Intel GPUs may not be fully optimized. The project is provided "as is" without warranty.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
386 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.